import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import json
from sklearn.model_selection import train_test_split
from sklearn.cluster import KMeans
from scipy.stats import zscore
# 1A...reading the csv file to a dataframe
Car_df = pd.read_csv("Car name.csv")
Car_df
| car_name | |
|---|---|
| 0 | chevrolet chevelle malibu |
| 1 | buick skylark 320 |
| 2 | plymouth satellite |
| 3 | amc rebel sst |
| 4 | ford torino |
| ... | ... |
| 393 | ford mustang gl |
| 394 | vw pickup |
| 395 | dodge rampage |
| 396 | ford ranger |
| 397 | chevy s-10 |
398 rows × 1 columns
# 1B...reading attributes json file into a dataframe
Car_attr_df = pd.read_json("Car-Attributes.json")
Car_attr_df
| mpg | cyl | disp | hp | wt | acc | yr | origin | |
|---|---|---|---|---|---|---|---|---|
| 0 | 18.0 | 8 | 307.0 | 130 | 3504 | 12.0 | 70 | 1 |
| 1 | 15.0 | 8 | 350.0 | 165 | 3693 | 11.5 | 70 | 1 |
| 2 | 18.0 | 8 | 318.0 | 150 | 3436 | 11.0 | 70 | 1 |
| 3 | 16.0 | 8 | 304.0 | 150 | 3433 | 12.0 | 70 | 1 |
| 4 | 17.0 | 8 | 302.0 | 140 | 3449 | 10.5 | 70 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 393 | 27.0 | 4 | 140.0 | 86 | 2790 | 15.6 | 82 | 1 |
| 394 | 44.0 | 4 | 97.0 | 52 | 2130 | 24.6 | 82 | 2 |
| 395 | 32.0 | 4 | 135.0 | 84 | 2295 | 11.6 | 82 | 1 |
| 396 | 28.0 | 4 | 120.0 | 79 | 2625 | 18.6 | 82 | 1 |
| 397 | 31.0 | 4 | 119.0 | 82 | 2720 | 19.4 | 82 | 1 |
398 rows × 8 columns
print("Car name shape",Car_df.shape)
print("Car attr shape",Car_attr_df.shape)
Car name shape (398, 1) Car attr shape (398, 8)
# 1C...Merging both dataframes to form a single dataframe
Car_final = pd.concat([Car_df, Car_attr_df], axis= 1)
# 1D... 5 point summary
Car_final.shape
(398, 9)
Car_final
| car_name | mpg | cyl | disp | hp | wt | acc | yr | origin | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | chevrolet chevelle malibu | 18.0 | 8 | 307.0 | 130 | 3504 | 12.0 | 70 | 1 |
| 1 | buick skylark 320 | 15.0 | 8 | 350.0 | 165 | 3693 | 11.5 | 70 | 1 |
| 2 | plymouth satellite | 18.0 | 8 | 318.0 | 150 | 3436 | 11.0 | 70 | 1 |
| 3 | amc rebel sst | 16.0 | 8 | 304.0 | 150 | 3433 | 12.0 | 70 | 1 |
| 4 | ford torino | 17.0 | 8 | 302.0 | 140 | 3449 | 10.5 | 70 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 393 | ford mustang gl | 27.0 | 4 | 140.0 | 86 | 2790 | 15.6 | 82 | 1 |
| 394 | vw pickup | 44.0 | 4 | 97.0 | 52 | 2130 | 24.6 | 82 | 2 |
| 395 | dodge rampage | 32.0 | 4 | 135.0 | 84 | 2295 | 11.6 | 82 | 1 |
| 396 | ford ranger | 28.0 | 4 | 120.0 | 79 | 2625 | 18.6 | 82 | 1 |
| 397 | chevy s-10 | 31.0 | 4 | 119.0 | 82 | 2720 | 19.4 | 82 | 1 |
398 rows × 9 columns
Car_final.dtypes
car_name object mpg float64 cyl int64 disp float64 hp object wt int64 acc float64 yr int64 origin int64 dtype: object
Car_final.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 398 entries, 0 to 397 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 car_name 398 non-null object 1 mpg 398 non-null float64 2 cyl 398 non-null int64 3 disp 398 non-null float64 4 hp 398 non-null object 5 wt 398 non-null int64 6 acc 398 non-null float64 7 yr 398 non-null int64 8 origin 398 non-null int64 dtypes: float64(3), int64(4), object(2) memory usage: 28.1+ KB
Car_final.describe()
| mpg | cyl | disp | wt | acc | yr | origin | |
|---|---|---|---|---|---|---|---|
| count | 398.000000 | 398.000000 | 398.000000 | 398.000000 | 398.000000 | 398.000000 | 398.000000 |
| mean | 23.514573 | 5.454774 | 193.425879 | 2970.424623 | 15.568090 | 76.010050 | 1.572864 |
| std | 7.815984 | 1.701004 | 104.269838 | 846.841774 | 2.757689 | 3.697627 | 0.802055 |
| min | 9.000000 | 3.000000 | 68.000000 | 1613.000000 | 8.000000 | 70.000000 | 1.000000 |
| 25% | 17.500000 | 4.000000 | 104.250000 | 2223.750000 | 13.825000 | 73.000000 | 1.000000 |
| 50% | 23.000000 | 4.000000 | 148.500000 | 2803.500000 | 15.500000 | 76.000000 | 1.000000 |
| 75% | 29.000000 | 8.000000 | 262.000000 | 3608.000000 | 17.175000 | 79.000000 | 2.000000 |
| max | 46.600000 | 8.000000 | 455.000000 | 5140.000000 | 24.800000 | 82.000000 | 3.000000 |
Car_final.corr()
| mpg | cyl | disp | wt | acc | yr | origin | |
|---|---|---|---|---|---|---|---|
| mpg | 1.000000 | -0.775396 | -0.804203 | -0.831741 | 0.420289 | 0.579267 | 0.563450 |
| cyl | -0.775396 | 1.000000 | 0.950721 | 0.896017 | -0.505419 | -0.348746 | -0.562543 |
| disp | -0.804203 | 0.950721 | 1.000000 | 0.932824 | -0.543684 | -0.370164 | -0.609409 |
| wt | -0.831741 | 0.896017 | 0.932824 | 1.000000 | -0.417457 | -0.306564 | -0.581024 |
| acc | 0.420289 | -0.505419 | -0.543684 | -0.417457 | 1.000000 | 0.288137 | 0.205873 |
| yr | 0.579267 | -0.348746 | -0.370164 | -0.306564 | 0.288137 | 1.000000 | 0.180662 |
| origin | 0.563450 | -0.562543 | -0.609409 | -0.581024 | 0.205873 | 0.180662 | 1.000000 |
sns.set(rc={'figure.figsize':(15.7,8)})
sns.set(style="ticks", color_codes=True)
sns.heatmap(Car_final.corr(), annot=True, linewidths=0.5, center=0, cbar=False, cmap="YlGnBu")
<AxesSubplot:>
Miles per gallon is the Dependent variable and all others are Independent variables
HP is also a numerical column but not shown in describe(), and shows up as an Object, meaning it has some missing values
Displacement & Cylinder seems to be highly positively corelated ... 0.95
Weight and Cylinder seems to be highty positively corelated 0.90
Weight and Displacement seems to be highly positively corelated 0.93
Weight and Miles per gallon are negatively corelated.............. 0.83
Displacement and Miles per gallon are also negatively corelated... 0.80
Cylinder and Miles per gallon are also negatively corelated .... 0.78
car_percent_missing = Car_final.isnull().sum() * 100 / len(Car_final)
car_missing_value_df = pd.DataFrame({'column_name': Car_final.columns,
'percent_missing': car_percent_missing})
car_missing_value_df
| column_name | percent_missing | |
|---|---|---|
| car_name | car_name | 0.0 |
| mpg | mpg | 0.0 |
| cyl | cyl | 0.0 |
| disp | disp | 0.0 |
| hp | hp | 0.0 |
| wt | wt | 0.0 |
| acc | acc | 0.0 |
| yr | yr | 0.0 |
| origin | origin | 0.0 |
for i in Car_final.columns:
if i == 'car_name':
print("")
else:
print("column name",i)
print("")
print("Unique values",Car_final[i].unique())
print("")
print("null values",Car_final[i].isnull().sum())
print("na values",Car_final[i].isna().sum())
print("% S unique values",Car_final[i].value_counts())
print("")
column name mpg
Unique values [18. 15. 16. 17. 14. 24. 22. 21. 27. 26. 25. 10. 11. 9.
28. 19. 12. 13. 23. 30. 31. 35. 20. 29. 32. 33. 17.5 15.5
14.5 22.5 24.5 18.5 29.5 26.5 16.5 31.5 36. 25.5 33.5 20.5 30.5 21.5
43.1 36.1 32.8 39.4 19.9 19.4 20.2 19.2 25.1 20.6 20.8 18.6 18.1 17.7
27.5 27.2 30.9 21.1 23.2 23.8 23.9 20.3 21.6 16.2 19.8 22.3 17.6 18.2
16.9 31.9 34.1 35.7 27.4 25.4 34.2 34.5 31.8 37.3 28.4 28.8 26.8 41.5
38.1 32.1 37.2 26.4 24.3 19.1 34.3 29.8 31.3 37. 32.2 46.6 27.9 40.8
44.3 43.4 36.4 44.6 40.9 33.8 32.7 23.7 23.6 32.4 26.6 25.8 23.5 39.1
39. 35.1 32.3 37.7 34.7 34.4 29.9 33.7 32.9 31.6 28.1 30.7 24.2 22.4
34. 38. 44. ]
null values 0
na values 0
% S unique values 13.0 20
14.0 19
18.0 17
15.0 16
26.0 14
..
26.5 1
19.1 1
33.8 1
28.1 1
31.8 1
Name: mpg, Length: 129, dtype: int64
column name cyl
Unique values [8 4 6 3 5]
null values 0
na values 0
% S unique values 4 204
8 103
6 84
3 4
5 3
Name: cyl, dtype: int64
column name disp
Unique values [307. 350. 318. 304. 302. 429. 454. 440. 455. 390. 383. 340.
400. 113. 198. 199. 200. 97. 110. 107. 104. 121. 360. 140.
98. 232. 225. 250. 351. 258. 122. 116. 79. 88. 71. 72.
91. 97.5 70. 120. 96. 108. 155. 68. 114. 156. 76. 83.
90. 231. 262. 134. 119. 171. 115. 101. 305. 85. 130. 168.
111. 260. 151. 146. 80. 78. 105. 131. 163. 89. 267. 86.
183. 141. 173. 135. 81. 100. 145. 112. 181. 144. ]
null values 0
na values 0
% S unique values 97.0 21
350.0 18
98.0 18
318.0 17
250.0 17
..
83.0 1
181.0 1
81.0 1
96.0 1
144.0 1
Name: disp, Length: 82, dtype: int64
column name hp
Unique values [130 165 150 140 198 220 215 225 190 170 160 95 97 85 88 46 87 90 113 200
210 193 '?' 100 105 175 153 180 110 72 86 70 76 65 69 60 80 54 208 155
112 92 145 137 158 167 94 107 230 49 75 91 122 67 83 78 52 61 93 148 129
96 71 98 115 53 81 79 120 152 102 108 68 58 149 89 63 48 66 139 103 125
133 138 135 142 77 62 132 84 64 74 116 82]
null values 0
na values 0
% S unique values 150 22
90 20
88 19
110 18
100 17
..
64 1
94 1
158 1
135 1
102 1
Name: hp, Length: 94, dtype: int64
column name wt
Unique values [3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 3563 3609 3761 3086
2372 2833 2774 2587 2130 1835 2672 2430 2375 2234 2648 4615 4376 4382
4732 2264 2228 2046 2634 3439 3329 3302 3288 4209 4464 4154 4096 4955
4746 5140 2962 2408 3282 3139 2220 2123 2074 2065 1773 1613 1834 1955
2278 2126 2254 2226 4274 4385 4135 4129 3672 4633 4502 4456 4422 2330
3892 4098 4294 4077 2933 2511 2979 2189 2395 2288 2506 2164 2100 4100
3988 4042 3777 4952 4363 4237 4735 4951 3821 3121 3278 2945 3021 2904
1950 4997 4906 4654 4499 2789 2279 2401 2379 2124 2310 2472 2265 4082
4278 1867 2158 2582 2868 3399 2660 2807 3664 3102 2875 2901 3336 2451
1836 2542 3781 3632 3613 4141 4699 4457 4638 4257 2219 1963 2300 1649
2003 2125 2108 2246 2489 2391 2000 3264 3459 3432 3158 4668 4440 4498
4657 3907 3897 3730 3785 3039 3221 3169 2171 2639 2914 2592 2702 2223
2545 2984 1937 3211 2694 2957 2671 1795 2464 2572 2255 2202 4215 4190
3962 3233 3353 3012 3085 2035 3651 3574 3645 3193 1825 1990 2155 2565
3150 3940 3270 2930 3820 4380 4055 3870 3755 2045 1945 3880 4060 4140
4295 3520 3425 3630 3525 4220 4165 4325 4335 1940 2740 2755 2051 2075
1985 2190 2815 2600 2720 1800 2070 3365 3735 3570 3535 3155 2965 3430
3210 3380 3070 3620 3410 3445 3205 4080 2560 2230 2515 2745 2855 2405
2830 3140 2795 2135 3245 2990 2890 3265 3360 3840 3725 3955 3830 4360
4054 3605 1925 1975 1915 2670 3530 3900 3190 3420 2200 2150 2020 2595
2700 2556 2144 1968 2120 2019 2678 2870 3003 3381 2188 2711 2434 2110
2800 2085 2335 2950 3250 1850 2145 1845 2910 2420 2500 2905 2290 2490
2635 2620 2725 2385 1755 1875 1760 2050 2215 2380 2320 2210 2350 2615
3230 3160 2900 3415 3060 3465 2605 2640 2575 2525 2735 2865 3035 1980
2025 1970 2160 2205 2245 1965 1995 3015 2585 2835 2665 2370 2790 2295
2625]
null values 0
na values 0
% S unique values 1985 4
2130 4
2720 3
2125 3
2300 3
..
3761 1
2223 1
2735 1
3245 1
2145 1
Name: wt, Length: 351, dtype: int64
column name acc
Unique values [12. 11.5 11. 10.5 10. 9. 8.5 8. 9.5 15. 15.5 16. 14.5 20.5
17.5 12.5 14. 13.5 18.5 19. 13. 19.5 18. 17. 23.5 16.5 21. 16.9
14.9 17.7 15.3 13.9 12.8 15.4 17.6 22.2 22.1 14.2 17.4 16.2 17.8 12.2
16.4 13.6 15.7 13.2 21.9 16.7 12.1 14.8 18.6 16.8 13.7 11.1 11.4 18.2
15.8 15.9 14.1 21.5 14.4 19.4 19.2 17.2 18.7 15.1 13.4 11.2 14.7 16.6
17.3 15.2 14.3 20.1 24.8 11.3 12.9 18.8 18.1 17.9 21.7 23.7 19.9 21.8
13.8 12.6 16.1 20.7 18.3 20.4 19.6 17.1 15.6 24.6 11.6]
null values 0
na values 0
% S unique values 14.5 23
15.5 21
14.0 16
16.0 16
13.5 15
..
12.1 1
24.6 1
19.9 1
20.7 1
18.3 1
Name: acc, Length: 95, dtype: int64
column name yr
Unique values [70 71 72 73 74 75 76 77 78 79 80 81 82]
null values 0
na values 0
% S unique values 73 40
78 36
76 34
82 31
75 30
70 29
79 29
80 29
81 29
71 28
72 28
77 28
74 27
Name: yr, dtype: int64
column name origin
Unique values [1 3 2]
null values 0
na values 0
% S unique values 1 249
3 79
2 70
Name: origin, dtype: int64
Car_final.describe(include=object)
| car_name | hp | |
|---|---|---|
| count | 398 | 398 |
| unique | 305 | 94 |
| top | ford pinto | 150 |
| freq | 6 | 22 |
# 2B..... Checking for Duplicated rows in the dataset
Car_final.duplicated().sum()
0
#There are no duplicated rows in the dataset
# 2C......... Pairplot for all features
sns.set(rc={'figure.figsize':(18.7,1200.27)})
sns. set(style="ticks", color_codes=True)
sns.pairplot(Car_final);
# 2D......... Scatterplot with Weight and displacement
sns.set(rc={'figure.figsize':(10.7,10.27)})
g =sns.scatterplot(x="wt", y="disp",
hue="cyl",
data=Car_final,
palette=['green','orange','brown','dodgerblue','red'], legend='full')
sns.set(rc={'figure.figsize':(30.7,30.27)})
cols = Car_final[['wt','disp','cyl']]
sns.pairplot(cols, hue='cyl')
<seaborn.axisgrid.PairGrid at 0x244e3b0e6a0>
# 2E....Insights on scatterplot between weight and displacement
Weight and Displacement looks to be positively correlated and as the Cylinders keep increasing for every unit weight increase
there is a proportionate increase in the displacement......
As found out earlier in the correlation heatmap, Weight and displacement has a 0.93 corelation which is pretty close to 1
small number of cylinders seem to be used in lesser weight cars while higher number of cylinders are used in heavier cars
There are only a handfull of 3 cylinder and 5 cylinder cars, while 4, 6 and 8 cylinder cars are in plenty
# 2F... Scatterplot with weight and miles per gallon
sns.set(rc={'figure.figsize':(10.7,10.27)})
g =sns.scatterplot(x="wt", y="mpg",
hue="cyl",
data=Car_final,
palette=['green','orange','brown','dodgerblue','red'], legend='full')
!pip install plotly
import plotly.express as px
fig = px.scatter_3d(Car_final,x="cyl", y="disp", z="wt", color="mpg")
fig.show()
# 2G.... Insights on Weight and Miles per gallon
# Weight and Miles per gallon are (Negatively) correlated..... As the weight of the car increases the Fuel efficiency decreases.
# There is a -(0.83) correlation, not as strong as weight and displacement but still good enough.
# Lighter cars are more Fuel efficient with higher miles per gallon while Heavier cars are bad at Fuel efficiency and gives
# lesser Miles per gallon.
# Heavier cars tend to have higher number of Cylinders
# 2H........ Checking for Unexpected values in the dataset
Car_final.loc[Car_final["hp"] == '?'].value_counts()
car_name mpg cyl disp hp wt acc yr origin amc concord dl 23.0 4 151.0 ? 3035 20.5 82 1 1 ford maverick 21.0 6 200.0 ? 2875 17.0 74 1 1 ford mustang cobra 23.6 4 140.0 ? 2905 14.3 80 1 1 ford pinto 25.0 4 98.0 ? 2046 19.0 71 1 1 renault 18i 34.5 4 100.0 ? 2320 15.8 81 2 1 renault lecar deluxe 40.9 4 85.0 ? 1835 17.3 80 2 1 dtype: int64
Car_final['hp'].replace("?",np.nan, inplace=True)
# 6 rows contain ?, impute approach is to replace with median values
Car_final['hp'].fillna((Car_final['hp'].median()), inplace=True)
Car_final['hp'] = Car_final['hp'].astype('float')
Car_final.describe(include=object)
| car_name | |
|---|---|
| count | 398 |
| unique | 305 |
| top | ford pinto |
| freq | 6 |
Car_final.dtypes
car_name object mpg float64 cyl int64 disp float64 hp float64 wt int64 acc float64 yr int64 origin int64 dtype: object
Car_final.describe()
| mpg | cyl | disp | hp | wt | acc | yr | origin | |
|---|---|---|---|---|---|---|---|---|
| count | 398.000000 | 398.000000 | 398.000000 | 398.000000 | 398.000000 | 398.000000 | 398.000000 | 398.000000 |
| mean | 23.514573 | 5.454774 | 193.425879 | 104.304020 | 2970.424623 | 15.568090 | 76.010050 | 1.572864 |
| std | 7.815984 | 1.701004 | 104.269838 | 38.222625 | 846.841774 | 2.757689 | 3.697627 | 0.802055 |
| min | 9.000000 | 3.000000 | 68.000000 | 46.000000 | 1613.000000 | 8.000000 | 70.000000 | 1.000000 |
| 25% | 17.500000 | 4.000000 | 104.250000 | 76.000000 | 2223.750000 | 13.825000 | 73.000000 | 1.000000 |
| 50% | 23.000000 | 4.000000 | 148.500000 | 93.500000 | 2803.500000 | 15.500000 | 76.000000 | 1.000000 |
| 75% | 29.000000 | 8.000000 | 262.000000 | 125.000000 | 3608.000000 | 17.175000 | 79.000000 | 2.000000 |
| max | 46.600000 | 8.000000 | 455.000000 | 230.000000 | 5140.000000 | 24.800000 | 82.000000 | 3.000000 |
sns.distplot(Car_final['hp'])
C:\Users\HP\anaconda3\lib\site-packages\seaborn\distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
<AxesSubplot:xlabel='hp', ylabel='Density'>
f = plt.figure()
f.set_figwidth(25)
f.set_figheight(10)
pd.value_counts(Car_final["hp"]).plot(kind="bar")
plt.show()
Car_final.describe()
| mpg | cyl | disp | hp | wt | acc | yr | origin | Category | |
|---|---|---|---|---|---|---|---|---|---|
| count | 398.000000 | 398.000000 | 398.000000 | 398.000000 | 398.000000 | 398.000000 | 398.000000 | 398.000000 | 398.000000 |
| mean | 23.514573 | 5.454774 | 193.425879 | 104.304020 | 2970.424623 | 15.568090 | 76.010050 | 1.572864 | 1.572864 |
| std | 7.815984 | 1.701004 | 104.269838 | 38.222625 | 846.841774 | 2.757689 | 3.697627 | 0.802055 | 1.288769 |
| min | 9.000000 | 3.000000 | 68.000000 | 46.000000 | 1613.000000 | 8.000000 | 70.000000 | 1.000000 | 0.000000 |
| 25% | 17.500000 | 4.000000 | 104.250000 | 76.000000 | 2223.750000 | 13.825000 | 73.000000 | 1.000000 | 1.000000 |
| 50% | 23.000000 | 4.000000 | 148.500000 | 93.500000 | 2803.500000 | 15.500000 | 76.000000 | 1.000000 | 1.000000 |
| 75% | 29.000000 | 8.000000 | 262.000000 | 125.000000 | 3608.000000 | 17.175000 | 79.000000 | 2.000000 | 3.000000 |
| max | 46.600000 | 8.000000 | 455.000000 | 230.000000 | 5140.000000 | 24.800000 | 82.000000 | 3.000000 | 4.000000 |
g = sns.PairGrid(Car_final, diag_sharey=False)
g.map_lower(sns.scatterplot, alpha=0.3, edgecolor='none')
g.map_diag(sns.kdeplot)
g.map_upper(sns.kdeplot)
<seaborn.axisgrid.PairGrid at 0x1ee67b05a90>
# This pairplot will include hp also.
sns.set(rc={'figure.figsize':(18.7,1200.27)})
sns.set(style="ticks", color_codes=True)
sns.pairplot(Car_final, diag_kind='kde');
#........ Re-vizualizing the Heatmap of co-relation including hp
sns.set(rc={'figure.figsize':(15.7,8)})
sns.set(style="ticks", color_codes=True)
sns.heatmap(Car_final.corr(), annot=True, linewidths=0.5, center=0, cbar=False, cmap="YlGnBu")
<AxesSubplot:>
# .... hp and displacement has postive co-relation ... 0.90
# ..... hp and weight has positve co-relation .... 0.86
# ..... hp and cylinder has a positive co-relation ... 0.84
#.....
Car_With_noName=Car_final.iloc[:,1:] # ....... Excluding car name from the dataset
Car_Scaled=Car_With_noName.apply(zscore) #..... applying Z score to scale the data since Wt has huge numbers
Car_Scaled
| mpg | cyl | disp | hp | wt | acc | yr | origin | |
|---|---|---|---|---|---|---|---|---|
| 0 | -0.706439 | 1.498191 | 1.090604 | 0.673118 | 0.630870 | -1.295498 | -1.627426 | -0.715145 |
| 1 | -1.090751 | 1.498191 | 1.503514 | 1.589958 | 0.854333 | -1.477038 | -1.627426 | -0.715145 |
| 2 | -0.706439 | 1.498191 | 1.196232 | 1.197027 | 0.550470 | -1.658577 | -1.627426 | -0.715145 |
| 3 | -0.962647 | 1.498191 | 1.061796 | 1.197027 | 0.546923 | -1.295498 | -1.627426 | -0.715145 |
| 4 | -0.834543 | 1.498191 | 1.042591 | 0.935072 | 0.565841 | -1.840117 | -1.627426 | -0.715145 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 393 | 0.446497 | -0.856321 | -0.513026 | -0.479482 | -0.213324 | 0.011586 | 1.621983 | -0.715145 |
| 394 | 2.624265 | -0.856321 | -0.925936 | -1.370127 | -0.993671 | 3.279296 | 1.621983 | 0.533222 |
| 395 | 1.087017 | -0.856321 | -0.561039 | -0.531873 | -0.798585 | -1.440730 | 1.621983 | -0.715145 |
| 396 | 0.574601 | -0.856321 | -0.705077 | -0.662850 | -0.408411 | 1.100822 | 1.621983 | -0.715145 |
| 397 | 0.958913 | -0.856321 | -0.714680 | -0.584264 | -0.296088 | 1.391285 | 1.621983 | -0.715145 |
398 rows × 8 columns
# 3A.... Applying K-means clustering........ going with a range between 2 to 10 clusters
k_range = range(2,10)
errors = []
for k in k_range:
clusters = KMeans(k, n_init = 5)
clusters.fit(Car_Scaled)
labels = clusters.labels_
centroids = clusters.cluster_centers_
errors.append(clusters.inertia_)
clusters_df = pd.DataFrame({"k": k_range, "cluster_errors": errors})
#clusters_df[0:15]
clusters_df
| k | cluster_errors | |
|---|---|---|
| 0 | 2 | 1588.592457 |
| 1 | 3 | 1190.043653 |
| 2 | 4 | 987.943167 |
| 3 | 5 | 829.715787 |
| 4 | 6 | 760.185462 |
| 5 | 7 | 683.162012 |
| 6 | 8 | 649.731082 |
| 7 | 9 | 602.872055 |
# 3B.......... Plotting vizual and finding elbow
from matplotlib import cm
plt.figure(figsize=(12,6))
plt.plot( clusters_df.k, clusters_df.cluster_errors, marker = "o" )
plt.xlabel('k clusters')
plt.ylabel('Average distortion')
plt.vlines(x=5, ymin=min(clusters_df.cluster_errors), ymax=max(clusters_df.cluster_errors), colors='red', ls=':', lw=2, label='vline_single - full height')
plt.vlines(x=7, ymin=min(clusters_df.cluster_errors), ymax=max(clusters_df.cluster_errors), colors='red', ls=':', lw=2, label='vline_single - full height')
plt.title('Selecting k with the Elbow Method')
Text(0.5, 1.0, 'Selecting k with the Elbow Method')
!pip install yellowbrick
Collecting yellowbrick Downloading yellowbrick-1.4-py3-none-any.whl (274 kB) Requirement already satisfied: numpy>=1.16.0 in c:\users\hp\anaconda3\lib\site-packages (from yellowbrick) (1.20.1) Requirement already satisfied: scipy>=1.0.0 in c:\users\hp\anaconda3\lib\site-packages (from yellowbrick) (1.6.2) Requirement already satisfied: cycler>=0.10.0 in c:\users\hp\anaconda3\lib\site-packages (from yellowbrick) (0.10.0) Requirement already satisfied: scikit-learn>=1.0.0 in c:\users\hp\anaconda3\lib\site-packages (from yellowbrick) (1.0.2) Requirement already satisfied: matplotlib!=3.0.0,>=2.0.2 in c:\users\hp\anaconda3\lib\site-packages (from yellowbrick) (3.3.4) Requirement already satisfied: six in c:\users\hp\anaconda3\lib\site-packages (from cycler>=0.10.0->yellowbrick) (1.15.0) Requirement already satisfied: python-dateutil>=2.1 in c:\users\hp\anaconda3\lib\site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (2.8.1) Requirement already satisfied: kiwisolver>=1.0.1 in c:\users\hp\anaconda3\lib\site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (1.3.1) Requirement already satisfied: pillow>=6.2.0 in c:\users\hp\anaconda3\lib\site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (8.2.0) Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.3 in c:\users\hp\anaconda3\lib\site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (2.4.7) Requirement already satisfied: joblib>=0.11 in c:\users\hp\anaconda3\lib\site-packages (from scikit-learn>=1.0.0->yellowbrick) (1.0.1) Requirement already satisfied: threadpoolctl>=2.0.0 in c:\users\hp\anaconda3\lib\site-packages (from scikit-learn>=1.0.0->yellowbrick) (2.1.0) Installing collected packages: yellowbrick Successfully installed yellowbrick-1.4
from yellowbrick.cluster import KElbowVisualizer
visualizer = KElbowVisualizer(clusters, k=(2,10))
visualizer.fit(Car_Scaled) # Fit the data to the visualizer
visualizer.show() # Finalize and render the figure
<AxesSubplot:title={'center':'Distortion Score Elbow for KMeans Clustering'}, xlabel='k', ylabel='distortion score'>
# 3C....... K at 5 & 7 looks like a good elbow point....... So 5 & 7 Clusters could be a good option
# 3D...... Training with the optimal number of clusters - 5
kmeans = KMeans(n_clusters=5, n_init = 5, random_state=1)
kmeans.fit(Car_Scaled)
KMeans(n_clusters=5, n_init=5, random_state=1)
# Check the number of data in each cluster ............ checking with 5 clusters
labels = kmeans.labels_
counts = np.bincount(labels[labels>=0])
print(counts)
[81 94 67 72 84]
# Distribution looks fine.............................. checking with 5 clusters
# let us check the centers in each group
centroids = kmeans.cluster_centers_
centroid_df = pd.DataFrame(centroids, columns = list(Car_Scaled) )
centroid_df.transpose()
| 0 | 1 | 2 | 3 | 4 | |
|---|---|---|---|---|---|
| mpg | -0.564892 | -1.163797 | 0.675937 | 1.360839 | 0.141488 |
| cyl | 0.451742 | 1.498191 | -0.742109 | -0.807268 | -0.828291 |
| disp | 0.384281 | 1.503923 | -0.584400 | -0.901663 | -0.814535 |
| hp | -0.063427 | 1.521683 | -0.561587 | -0.823297 | -0.488058 |
| wt | 0.387993 | 1.404098 | -0.480763 | -0.963144 | -0.736371 |
| acc | 0.386319 | -1.086149 | 0.356240 | 0.423579 | 0.195719 |
| yr | -0.082954 | -0.688324 | 0.991501 | 0.922458 | -0.731260 |
| origin | -0.668909 | -0.715145 | -0.621983 | 1.486835 | 0.666976 |
pred=kmeans.predict(Car_Scaled)
pred
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 4, 0, 0, 0, 4, 4, 4, 4,
4, 4, 0, 1, 1, 1, 1, 4, 4, 4, 4, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1,
1, 0, 4, 0, 0, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 1, 1, 1, 1,
1, 1, 1, 1, 1, 4, 1, 1, 1, 1, 4, 4, 4, 4, 4, 4, 4, 4, 4, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 4, 1, 1, 1, 1, 0, 4, 4,
4, 4, 4, 0, 4, 1, 1, 4, 4, 4, 4, 1, 4, 4, 1, 0, 0, 0, 0, 3, 4, 3,
4, 0, 0, 0, 1, 1, 1, 1, 1, 4, 4, 4, 3, 3, 4, 4, 4, 4, 4, 4, 0, 0,
0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 4, 2, 0, 2, 4, 4, 4, 0, 4,
0, 4, 4, 4, 4, 3, 4, 4, 2, 2, 4, 1, 1, 1, 1, 0, 0, 0, 0, 2, 2, 4,
3, 0, 0, 0, 0, 4, 3, 3, 2, 4, 1, 4, 4, 0, 1, 1, 1, 1, 3, 2, 3, 2,
3, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 4, 2, 3, 2, 2, 2, 3, 4, 4,
4, 4, 3, 2, 3, 3, 3, 0, 1, 1, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 1, 0,
1, 1, 2, 3, 3, 2, 4, 2, 2, 3, 4, 0, 4, 0, 3, 3, 0, 0, 2, 0, 0, 0,
1, 1, 0, 1, 1, 0, 1, 3, 3, 2, 2, 2, 0, 2, 0, 2, 2, 3, 3, 2, 2, 2,
2, 3, 3, 2, 3, 2, 2, 2, 0, 3, 3, 3, 3, 3, 3, 2, 3, 3, 3, 3, 2, 3,
3, 3, 3, 3, 3, 3, 2, 3, 2, 2, 2, 2, 2, 3, 2, 3, 3, 3, 3, 3, 2, 2,
2, 3, 3, 3, 3, 3, 3, 2, 2, 3, 3, 0, 0, 0, 0, 2, 2, 2, 2, 2, 2, 2,
2, 3, 3, 3, 2, 2, 3, 3, 3, 3, 3, 3, 2, 2, 2, 0, 3, 2, 2, 2, 3, 2,
2, 2])
# 3E.......Appending the prediction ......Adding the new feature which will have labels based upon cluster value
Car_final["Category"] = pred
Car_Scaled["Category"] = pred
print("Categories Assigned : \n")
Car_final
Categories Assigned :
| car_name | mpg | cyl | disp | hp | wt | acc | yr | origin | Category | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | chevrolet chevelle malibu | 18.0 | 8 | 307.0 | 130.0 | 3504 | 12.0 | 70 | 1 | 1 |
| 1 | buick skylark 320 | 15.0 | 8 | 350.0 | 165.0 | 3693 | 11.5 | 70 | 1 | 1 |
| 2 | plymouth satellite | 18.0 | 8 | 318.0 | 150.0 | 3436 | 11.0 | 70 | 1 | 1 |
| 3 | amc rebel sst | 16.0 | 8 | 304.0 | 150.0 | 3433 | 12.0 | 70 | 1 | 1 |
| 4 | ford torino | 17.0 | 8 | 302.0 | 140.0 | 3449 | 10.5 | 70 | 1 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 393 | ford mustang gl | 27.0 | 4 | 140.0 | 86.0 | 2790 | 15.6 | 82 | 1 | 2 |
| 394 | vw pickup | 44.0 | 4 | 97.0 | 52.0 | 2130 | 24.6 | 82 | 2 | 3 |
| 395 | dodge rampage | 32.0 | 4 | 135.0 | 84.0 | 2295 | 11.6 | 82 | 1 | 2 |
| 396 | ford ranger | 28.0 | 4 | 120.0 | 79.0 | 2625 | 18.6 | 82 | 1 | 2 |
| 397 | chevy s-10 | 31.0 | 4 | 119.0 | 82.0 | 2720 | 19.4 | 82 | 1 | 2 |
398 rows × 10 columns
Car_Clust = Car_final.groupby(['Category'])
Car_Clust.mean()
| mpg | cyl | disp | hp | wt | acc | yr | origin | |
|---|---|---|---|---|---|---|---|---|
| Category | ||||||||
| 0 | 19.104938 | 6.222222 | 233.444444 | 101.882716 | 3298.580247 | 16.632099 | 75.703704 | 1.037037 |
| 1 | 14.429787 | 8.000000 | 350.042553 | 162.393617 | 4157.978723 | 12.576596 | 73.468085 | 1.000000 |
| 2 | 28.791045 | 4.194030 | 132.567164 | 82.865672 | 2563.805970 | 16.549254 | 79.671642 | 1.074627 |
| 3 | 34.137500 | 4.083333 | 99.527778 | 72.875000 | 2155.819444 | 16.734722 | 79.416667 | 2.763889 |
| 4 | 24.619048 | 4.047619 | 108.601190 | 85.672619 | 2347.619048 | 16.107143 | 73.309524 | 2.107143 |
# Inference for 5 cluster model
1. Cluster 3 has lowest Horse power, lowest weight, lowest displacement, and less cylinders but gives highest Miles per gallon
2. Cluster 1 has highest Horse power, highest weight, heighest displacement and highest cylinders gives lowest miles per gallon
3. Cluster 2 & Cluster 4 are almost similar to Cluster 3 with slight differences in features and gives good miles per gallon
4. Cluster 0 seems to be slightly comparable to Cluster 1 but with slight variations
# 3F.........Plot a visual and color the datapoints based upon clusters.
Car_box = Car_Scaled.boxplot(by='Category', layout = (2,4),figsize=(15,10), patch_artist=True,boxprops=dict(facecolor='r'))
# 3D...... Re-training with the second choice optimal number of clusters - 7
kmeans = KMeans(n_clusters=7, n_init = 5, random_state=1)
kmeans.fit(Car_Scaled)
KMeans(n_clusters=7, n_init=5, random_state=1)
# Check the number of data in each cluster ............ checking with 7 clusters
labels = kmeans.labels_
counts = np.bincount(labels[labels>=0])
print(counts)
[67 51 41 65 77 47 50]
# Distribution looks fine.............................. checking with 7 clusters
# let us check the centers in each group
centroids = kmeans.cluster_centers_
centroid_df = pd.DataFrame(centroids, columns = list(Car_Scaled) )
centroid_df.transpose()
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | |
|---|---|---|---|---|---|---|---|
| mpg | 0.675937 | -1.010372 | 0.209036 | 1.467978 | -5.577050e-01 | -1.291083 | 0.117526 |
| cyl | -0.742109 | 1.475108 | -0.856321 | -0.847265 | 4.126696e-01 | 1.498191 | -0.750368 |
| disp | -0.584400 | 1.200186 | -0.847594 | -0.935244 | 3.609329e-01 | 1.776473 | -0.755971 |
| hp | -0.561587 | 1.046531 | -0.730894 | -0.922789 | -1.135962e-01 | 1.984562 | -0.206525 |
| wt | -0.480763 | 1.217104 | -0.855424 | -1.023794 | 3.645660e-01 | 1.558910 | -0.591651 |
| acc | 0.356240 | -0.748032 | 0.752798 | 0.552294 | 4.218177e-01 | -1.385882 | -0.396515 |
| yr | 0.991501 | -0.130149 | -0.973582 | 0.922111 | -1.258051e-01 | -1.172278 | -0.300584 |
| origin | -0.621983 | -0.715145 | 0.167846 | 1.455093 | -6.665071e-01 | -0.715145 | 1.232307 |
| Category | 2.000000 | 0.921569 | 4.000000 | 3.000000 | -4.440892e-16 | 1.000000 | 3.860000 |
pred=kmeans.predict(Car_Scaled)
pred
array([1, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 4, 4, 4, 6, 2, 2, 2,
2, 6, 4, 5, 5, 5, 5, 6, 2, 6, 2, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5,
5, 4, 2, 4, 4, 2, 6, 2, 2, 2, 2, 2, 2, 6, 2, 2, 2, 2, 5, 5, 1, 5,
1, 5, 5, 5, 5, 6, 1, 1, 1, 1, 6, 2, 2, 2, 2, 6, 6, 2, 6, 5, 1, 1,
1, 1, 5, 5, 5, 1, 5, 5, 5, 4, 4, 4, 4, 4, 2, 5, 5, 5, 5, 4, 2, 2,
6, 6, 2, 4, 2, 1, 5, 2, 2, 6, 6, 1, 6, 6, 5, 4, 4, 4, 4, 3, 2, 3,
2, 4, 4, 4, 1, 1, 1, 1, 1, 2, 2, 6, 3, 3, 2, 2, 6, 6, 6, 2, 4, 4,
4, 4, 5, 1, 1, 1, 4, 4, 4, 4, 4, 4, 1, 6, 0, 4, 0, 6, 2, 6, 4, 6,
4, 6, 2, 6, 6, 3, 6, 2, 0, 0, 6, 1, 1, 1, 1, 4, 4, 4, 4, 0, 0, 6,
3, 4, 4, 4, 4, 6, 3, 3, 0, 6, 1, 2, 6, 4, 1, 1, 1, 1, 3, 0, 3, 0,
3, 1, 4, 1, 1, 4, 4, 4, 4, 5, 1, 5, 1, 6, 0, 3, 0, 0, 0, 3, 6, 6,
6, 6, 3, 0, 3, 3, 3, 4, 1, 1, 4, 4, 4, 0, 4, 4, 4, 4, 4, 4, 1, 1,
1, 1, 0, 6, 6, 0, 6, 0, 0, 6, 6, 4, 6, 4, 3, 3, 4, 4, 0, 4, 4, 1,
1, 1, 1, 1, 1, 1, 1, 3, 3, 0, 0, 0, 4, 0, 4, 0, 0, 3, 3, 0, 0, 0,
0, 3, 3, 0, 3, 0, 0, 0, 4, 3, 3, 3, 3, 3, 3, 0, 3, 3, 3, 3, 0, 3,
3, 3, 3, 6, 6, 3, 0, 3, 0, 0, 0, 0, 0, 3, 0, 3, 3, 3, 3, 3, 0, 0,
0, 3, 3, 3, 3, 3, 3, 0, 0, 6, 6, 4, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0,
0, 3, 3, 3, 0, 0, 3, 3, 3, 3, 3, 3, 0, 0, 0, 4, 3, 0, 0, 0, 3, 0,
0, 0])
Car_final
| car_name | mpg | cyl | disp | hp | wt | acc | yr | origin | Category | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | chevrolet chevelle malibu | 18.0 | 8 | 307.0 | 130.0 | 3504 | 12.0 | 70 | 1 | 1 |
| 1 | buick skylark 320 | 15.0 | 8 | 350.0 | 165.0 | 3693 | 11.5 | 70 | 1 | 1 |
| 2 | plymouth satellite | 18.0 | 8 | 318.0 | 150.0 | 3436 | 11.0 | 70 | 1 | 1 |
| 3 | amc rebel sst | 16.0 | 8 | 304.0 | 150.0 | 3433 | 12.0 | 70 | 1 | 1 |
| 4 | ford torino | 17.0 | 8 | 302.0 | 140.0 | 3449 | 10.5 | 70 | 1 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 393 | ford mustang gl | 27.0 | 4 | 140.0 | 86.0 | 2790 | 15.6 | 82 | 1 | 2 |
| 394 | vw pickup | 44.0 | 4 | 97.0 | 52.0 | 2130 | 24.6 | 82 | 2 | 3 |
| 395 | dodge rampage | 32.0 | 4 | 135.0 | 84.0 | 2295 | 11.6 | 82 | 1 | 2 |
| 396 | ford ranger | 28.0 | 4 | 120.0 | 79.0 | 2625 | 18.6 | 82 | 1 | 2 |
| 397 | chevy s-10 | 31.0 | 4 | 119.0 | 82.0 | 2720 | 19.4 | 82 | 1 | 2 |
398 rows × 10 columns
# Appending the prediction
Car_final["Category"] = pred
Car_Scaled["Category"] = pred
print("Categories Assigned : \n")
Car_final
Categories Assigned :
| car_name | mpg | cyl | disp | hp | wt | acc | yr | origin | Category | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | chevrolet chevelle malibu | 18.0 | 8 | 307.0 | 130.0 | 3504 | 12.0 | 70 | 1 | 1 |
| 1 | buick skylark 320 | 15.0 | 8 | 350.0 | 165.0 | 3693 | 11.5 | 70 | 1 | 5 |
| 2 | plymouth satellite | 18.0 | 8 | 318.0 | 150.0 | 3436 | 11.0 | 70 | 1 | 5 |
| 3 | amc rebel sst | 16.0 | 8 | 304.0 | 150.0 | 3433 | 12.0 | 70 | 1 | 5 |
| 4 | ford torino | 17.0 | 8 | 302.0 | 140.0 | 3449 | 10.5 | 70 | 1 | 5 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 393 | ford mustang gl | 27.0 | 4 | 140.0 | 86.0 | 2790 | 15.6 | 82 | 1 | 0 |
| 394 | vw pickup | 44.0 | 4 | 97.0 | 52.0 | 2130 | 24.6 | 82 | 2 | 3 |
| 395 | dodge rampage | 32.0 | 4 | 135.0 | 84.0 | 2295 | 11.6 | 82 | 1 | 0 |
| 396 | ford ranger | 28.0 | 4 | 120.0 | 79.0 | 2625 | 18.6 | 82 | 1 | 0 |
| 397 | chevy s-10 | 31.0 | 4 | 119.0 | 82.0 | 2720 | 19.4 | 82 | 1 | 0 |
398 rows × 10 columns
Car_Clust = Car_final.groupby(['Category'])
Car_Clust.mean()
| mpg | cyl | disp | hp | wt | acc | yr | origin | |
|---|---|---|---|---|---|---|---|---|
| Category | ||||||||
| 0 | 28.791045 | 4.194030 | 132.567164 | 82.865672 | 2563.805970 | 16.549254 | 79.671642 | 1.074627 |
| 1 | 15.627451 | 7.960784 | 318.411765 | 144.254902 | 3999.823529 | 13.507843 | 75.529412 | 1.000000 |
| 2 | 25.146341 | 4.000000 | 105.158537 | 76.402439 | 2246.926829 | 17.641463 | 72.414634 | 1.707317 |
| 3 | 34.973846 | 4.015385 | 96.030769 | 69.076923 | 2104.523077 | 17.089231 | 79.415385 | 2.738462 |
| 4 | 19.161039 | 6.155844 | 231.012987 | 99.967532 | 3278.766234 | 16.729870 | 75.545455 | 1.038961 |
| 5 | 13.436170 | 8.000000 | 378.425532 | 180.063830 | 4288.914894 | 11.751064 | 71.680851 | 1.000000 |
| 6 | 24.432000 | 4.180000 | 114.700000 | 96.420000 | 2470.020000 | 14.476000 | 74.900000 | 2.560000 |
# Inference for 7 Cluster model
1. Cluster 3 has lowest Horse power, lowest weight, lowest displacement, and less cylinders but gives highest Miles per gallon
2. Cluster 5 has highest Horse power, highest weight, heighest displacement and highest cylinders gives lowest miles per gallon
3. Cluster 1 is almost similar to Cluster 5 with a very slight difference in all the features
4. Cluster 2 & Cluster 6 are almost similar with slight differences in features
5. Cluster 0 seems to also give a good mileage with lesser number of cylinders,
Car_box = Car_Scaled.boxplot(by='Category', layout = (2,4),figsize=(15,10), patch_artist=True,boxprops=dict(facecolor='r'))
#... 3G...... Passing a new data and finding out to which cluster, it will assign it to...
#............ Pulling out a record for car name = "ford torino" from the existing dataset and creating it as a new data point
#............ This data point was assigned to cluster 5
New_data = Car_final[Car_final.loc[:,'car_name'] == 'ford torino']
New_data
| car_name | mpg | cyl | disp | hp | wt | acc | yr | origin | Category | |
|---|---|---|---|---|---|---|---|---|---|---|
| 4 | ford torino | 17.0 | 8 | 302.0 | 140.0 | 3449 | 10.5 | 70 | 1 | 5 |
Car_new = New_data.drop(['car_name'], axis=1)
Car_new
| mpg | cyl | disp | hp | wt | acc | yr | origin | Category | |
|---|---|---|---|---|---|---|---|---|---|
| 4 | 17.0 | 8 | 302.0 | 140.0 | 3449 | 10.5 | 70 | 1 | 5 |
New_pred = kmeans.predict(Car_new)
New_pred #...... the predict function is rightly predicting the new data has been assigned to Cluster 5
array([5])
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
# Use silhouette score to find optimal number of clusters to segment the data
num_clusters = np.arange(2,10)
results = {}
for size in num_clusters:
model = KMeans(n_clusters = size).fit(Car_Scaled)
predictions = model.predict(Car_Scaled)
results[size] = silhouette_score(Car_Scaled, predictions)
best_size = max(results, key=results.get)
results
{2: 0.3365376180833712,
3: 0.35982189814855536,
4: 0.4067080893430127,
5: 0.4429637051916042,
6: 0.5022359743463826,
7: 0.5153691119760141,
8: 0.4864490524283309,
9: 0.43352894036594447}
best_size
7
# .. Silhoutte score is another method by which we try to predict the best number of clusters and this has given a size of 7
# .. as the best size.
myList = results.items()
myList = sorted(myList)
x, y = zip(*myList)
plt.plot(x, y)
plt.show()
#1A... Reading Vehicle csv
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
Vehicle_df = pd.read_csv("vehicle.csv")
Vehicle_df
| compactness | circularity | distance_circularity | radius_ratio | pr.axis_aspect_ratio | max.length_aspect_ratio | scatter_ratio | elongatedness | pr.axis_rectangularity | max.length_rectangularity | scaled_variance | scaled_variance.1 | scaled_radius_of_gyration | scaled_radius_of_gyration.1 | skewness_about | skewness_about.1 | skewness_about.2 | hollows_ratio | class | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 95 | 48.0 | 83.0 | 178.0 | 72.0 | 10 | 162.0 | 42.0 | 20.0 | 159 | 176.0 | 379.0 | 184.0 | 70.0 | 6.0 | 16.0 | 187.0 | 197 | van |
| 1 | 91 | 41.0 | 84.0 | 141.0 | 57.0 | 9 | 149.0 | 45.0 | 19.0 | 143 | 170.0 | 330.0 | 158.0 | 72.0 | 9.0 | 14.0 | 189.0 | 199 | van |
| 2 | 104 | 50.0 | 106.0 | 209.0 | 66.0 | 10 | 207.0 | 32.0 | 23.0 | 158 | 223.0 | 635.0 | 220.0 | 73.0 | 14.0 | 9.0 | 188.0 | 196 | car |
| 3 | 93 | 41.0 | 82.0 | 159.0 | 63.0 | 9 | 144.0 | 46.0 | 19.0 | 143 | 160.0 | 309.0 | 127.0 | 63.0 | 6.0 | 10.0 | 199.0 | 207 | van |
| 4 | 85 | 44.0 | 70.0 | 205.0 | 103.0 | 52 | 149.0 | 45.0 | 19.0 | 144 | 241.0 | 325.0 | 188.0 | 127.0 | 9.0 | 11.0 | 180.0 | 183 | bus |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 841 | 93 | 39.0 | 87.0 | 183.0 | 64.0 | 8 | 169.0 | 40.0 | 20.0 | 134 | 200.0 | 422.0 | 149.0 | 72.0 | 7.0 | 25.0 | 188.0 | 195 | car |
| 842 | 89 | 46.0 | 84.0 | 163.0 | 66.0 | 11 | 159.0 | 43.0 | 20.0 | 159 | 173.0 | 368.0 | 176.0 | 72.0 | 1.0 | 20.0 | 186.0 | 197 | van |
| 843 | 106 | 54.0 | 101.0 | 222.0 | 67.0 | 12 | 222.0 | 30.0 | 25.0 | 173 | 228.0 | 721.0 | 200.0 | 70.0 | 3.0 | 4.0 | 187.0 | 201 | car |
| 844 | 86 | 36.0 | 78.0 | 146.0 | 58.0 | 7 | 135.0 | 50.0 | 18.0 | 124 | 155.0 | 270.0 | 148.0 | 66.0 | 0.0 | 25.0 | 190.0 | 195 | car |
| 845 | 85 | 36.0 | 66.0 | 123.0 | 55.0 | 5 | 120.0 | 56.0 | 17.0 | 128 | 140.0 | 212.0 | 131.0 | 73.0 | 1.0 | 18.0 | 186.0 | 190 | van |
846 rows × 19 columns
Vehicle_df.dtypes
compactness int64 circularity float64 distance_circularity float64 radius_ratio float64 pr.axis_aspect_ratio float64 max.length_aspect_ratio int64 scatter_ratio float64 elongatedness float64 pr.axis_rectangularity float64 max.length_rectangularity int64 scaled_variance float64 scaled_variance.1 float64 scaled_radius_of_gyration float64 scaled_radius_of_gyration.1 float64 skewness_about float64 skewness_about.1 float64 skewness_about.2 float64 hollows_ratio int64 class object dtype: object
Vehicle_df.describe()
| compactness | circularity | distance_circularity | radius_ratio | pr.axis_aspect_ratio | max.length_aspect_ratio | scatter_ratio | elongatedness | pr.axis_rectangularity | max.length_rectangularity | scaled_variance | scaled_variance.1 | scaled_radius_of_gyration | scaled_radius_of_gyration.1 | skewness_about | skewness_about.1 | skewness_about.2 | hollows_ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 846.000000 | 841.000000 | 842.000000 | 840.000000 | 844.000000 | 846.000000 | 845.000000 | 845.000000 | 843.000000 | 846.000000 | 843.000000 | 844.000000 | 844.000000 | 842.000000 | 840.000000 | 845.000000 | 845.000000 | 846.000000 |
| mean | 93.678487 | 44.828775 | 82.110451 | 168.888095 | 61.678910 | 8.567376 | 168.901775 | 40.933728 | 20.582444 | 147.998818 | 188.631079 | 439.494076 | 174.709716 | 72.447743 | 6.364286 | 12.602367 | 188.919527 | 195.632388 |
| std | 8.234474 | 6.152172 | 15.778292 | 33.520198 | 7.891463 | 4.601217 | 33.214848 | 7.816186 | 2.592933 | 14.515652 | 31.411004 | 176.666903 | 32.584808 | 7.486190 | 4.920649 | 8.936081 | 6.155809 | 7.438797 |
| min | 73.000000 | 33.000000 | 40.000000 | 104.000000 | 47.000000 | 2.000000 | 112.000000 | 26.000000 | 17.000000 | 118.000000 | 130.000000 | 184.000000 | 109.000000 | 59.000000 | 0.000000 | 0.000000 | 176.000000 | 181.000000 |
| 25% | 87.000000 | 40.000000 | 70.000000 | 141.000000 | 57.000000 | 7.000000 | 147.000000 | 33.000000 | 19.000000 | 137.000000 | 167.000000 | 318.000000 | 149.000000 | 67.000000 | 2.000000 | 5.000000 | 184.000000 | 190.250000 |
| 50% | 93.000000 | 44.000000 | 80.000000 | 167.000000 | 61.000000 | 8.000000 | 157.000000 | 43.000000 | 20.000000 | 146.000000 | 179.000000 | 363.500000 | 173.500000 | 71.500000 | 6.000000 | 11.000000 | 188.000000 | 197.000000 |
| 75% | 100.000000 | 49.000000 | 98.000000 | 195.000000 | 65.000000 | 10.000000 | 198.000000 | 46.000000 | 23.000000 | 159.000000 | 217.000000 | 587.000000 | 198.000000 | 75.000000 | 9.000000 | 19.000000 | 193.000000 | 201.000000 |
| max | 119.000000 | 59.000000 | 112.000000 | 333.000000 | 138.000000 | 55.000000 | 265.000000 | 61.000000 | 29.000000 | 188.000000 | 320.000000 | 1018.000000 | 268.000000 | 135.000000 | 22.000000 | 41.000000 | 206.000000 | 211.000000 |
...# 1B.......... Check percentage of missing values
There are totally 846 rows, but some columns seem to be missing a few values hence reporting lesser count
while the following non-object columns seems to be having no issues
This makes it 18 columns out of the 19
Vehicle_df.describe(include=object)
| class | |
|---|---|
| count | 846 |
| unique | 3 |
| top | car |
| freq | 429 |
Class column does'nt seem to have any missing values.
Vehicle_df.isnull().any()
compactness False circularity True distance_circularity True radius_ratio True pr.axis_aspect_ratio True max.length_aspect_ratio False scatter_ratio True elongatedness True pr.axis_rectangularity True max.length_rectangularity False scaled_variance True scaled_variance.1 True scaled_radius_of_gyration True scaled_radius_of_gyration.1 True skewness_about True skewness_about.1 True skewness_about.2 True hollows_ratio False class False dtype: bool
There are 14 columns listed with "True" which matches with our missing values 14 columns listed earlier.
for i in Vehicle_df.columns:
print("column is",i)
print("missing values",Vehicle_df[i].isnull().sum())
column is compactness missing values 0 column is circularity missing values 5 column is distance_circularity missing values 4 column is radius_ratio missing values 6 column is pr.axis_aspect_ratio missing values 2 column is max.length_aspect_ratio missing values 0 column is scatter_ratio missing values 1 column is elongatedness missing values 1 column is pr.axis_rectangularity missing values 3 column is max.length_rectangularity missing values 0 column is scaled_variance missing values 3 column is scaled_variance.1 missing values 2 column is scaled_radius_of_gyration missing values 2 column is scaled_radius_of_gyration.1 missing values 4 column is skewness_about missing values 6 column is skewness_about.1 missing values 1 column is skewness_about.2 missing values 1 column is hollows_ratio missing values 0 column is class missing values 0
These numbers match my earlier numbers
temp_vehicle = Vehicle_df[Vehicle_df.isna().any(axis=1)]
temp_vehicle
| compactness | circularity | distance_circularity | radius_ratio | pr.axis_aspect_ratio | max.length_aspect_ratio | scatter_ratio | elongatedness | pr.axis_rectangularity | max.length_rectangularity | scaled_variance | scaled_variance.1 | scaled_radius_of_gyration | scaled_radius_of_gyration.1 | skewness_about | skewness_about.1 | skewness_about.2 | hollows_ratio | class | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5 | 107 | NaN | 106.0 | 172.0 | 50.0 | 6 | 255.0 | 26.0 | 28.0 | 169 | 280.0 | 957.0 | 264.0 | 85.0 | 5.0 | 9.0 | 181.0 | 183 | bus |
| 9 | 93 | 44.0 | 98.0 | NaN | 62.0 | 11 | 183.0 | 36.0 | 22.0 | 146 | 202.0 | 505.0 | 152.0 | 64.0 | 4.0 | 14.0 | 195.0 | 204 | car |
| 19 | 101 | 56.0 | 100.0 | 215.0 | NaN | 10 | 208.0 | 32.0 | 24.0 | 169 | 227.0 | 651.0 | 223.0 | 74.0 | 6.0 | 5.0 | 186.0 | 193 | car |
| 35 | 100 | 46.0 | NaN | 172.0 | 67.0 | 9 | 157.0 | 43.0 | 20.0 | 150 | 170.0 | 363.0 | 184.0 | 67.0 | 17.0 | 7.0 | 192.0 | 200 | van |
| 66 | 81 | 43.0 | 68.0 | 125.0 | 57.0 | 8 | 149.0 | 46.0 | 19.0 | 146 | 169.0 | 323.0 | 172.0 | NaN | NaN | 18.0 | 179.0 | 184 | bus |
| 70 | 96 | 55.0 | 98.0 | 161.0 | 54.0 | 10 | 215.0 | 31.0 | NaN | 175 | 226.0 | 683.0 | 221.0 | 76.0 | 3.0 | 6.0 | 185.0 | 193 | car |
| 77 | 86 | 40.0 | 62.0 | 140.0 | 62.0 | 7 | 150.0 | 45.0 | 19.0 | 133 | 165.0 | 330.0 | 173.0 | NaN | 2.0 | 3.0 | 180.0 | 185 | car |
| 78 | 104 | 52.0 | 94.0 | NaN | 66.0 | 5 | 208.0 | 31.0 | 24.0 | 161 | 227.0 | 666.0 | 218.0 | 76.0 | 11.0 | 4.0 | 193.0 | 191 | bus |
| 105 | 108 | NaN | 103.0 | 202.0 | 64.0 | 10 | 220.0 | 30.0 | 25.0 | 168 | NaN | 711.0 | 214.0 | 73.0 | 11.0 | NaN | 188.0 | 199 | car |
| 118 | 85 | NaN | NaN | 128.0 | 56.0 | 8 | 150.0 | 46.0 | 19.0 | 144 | 168.0 | 324.0 | 173.0 | 82.0 | 9.0 | 14.0 | 180.0 | 184 | bus |
| 141 | 81 | 42.0 | 63.0 | 125.0 | 55.0 | 8 | 149.0 | 46.0 | 19.0 | 145 | 166.0 | 320.0 | 172.0 | 86.0 | NaN | 7.0 | 179.0 | 182 | bus |
| 159 | 91 | 45.0 | 75.0 | NaN | 57.0 | 6 | 150.0 | 44.0 | 19.0 | 146 | 170.0 | 335.0 | 180.0 | 66.0 | 16.0 | 2.0 | 193.0 | 198 | car |
| 177 | 89 | 44.0 | 72.0 | 160.0 | 66.0 | 7 | 144.0 | 46.0 | 19.0 | 147 | 166.0 | 312.0 | 169.0 | 69.0 | NaN | 1.0 | 191.0 | 198 | bus |
| 192 | 93 | 43.0 | 76.0 | 149.0 | 57.0 | 7 | 149.0 | 44.0 | 19.0 | 143 | 172.0 | 335.0 | 176.0 | NaN | 14.0 | 0.0 | 189.0 | 194 | car |
| 207 | 85 | 42.0 | NaN | 121.0 | 55.0 | 7 | 149.0 | 46.0 | 19.0 | 146 | 167.0 | 323.0 | NaN | 85.0 | 1.0 | 6.0 | 179.0 | 182 | bus |
| 215 | 90 | 39.0 | 86.0 | 169.0 | 62.0 | 7 | 162.0 | NaN | 20.0 | 131 | 194.0 | 388.0 | 147.0 | 74.0 | 1.0 | 22.0 | 185.0 | 191 | car |
| 222 | 100 | 50.0 | 81.0 | 197.0 | NaN | 6 | 186.0 | 34.0 | 22.0 | 158 | 206.0 | 531.0 | 198.0 | 74.0 | NaN | 1.0 | 197.0 | 198 | bus |
| 237 | 85 | 45.0 | 65.0 | 128.0 | 56.0 | 8 | 151.0 | 45.0 | NaN | 145 | 170.0 | 332.0 | 186.0 | 81.0 | 1.0 | 10.0 | 179.0 | 184 | bus |
| 249 | 85 | 34.0 | 53.0 | 127.0 | 58.0 | 6 | NaN | 58.0 | 17.0 | 121 | 137.0 | 197.0 | 127.0 | 70.0 | NaN | 20.0 | 185.0 | 189 | car |
| 266 | 86 | NaN | 65.0 | 116.0 | 53.0 | 6 | 152.0 | 45.0 | 19.0 | 141 | 175.0 | 335.0 | NaN | 85.0 | 5.0 | 4.0 | 179.0 | 183 | bus |
| 273 | 96 | 45.0 | 80.0 | 162.0 | 63.0 | 9 | 146.0 | 46.0 | NaN | 148 | 161.0 | 316.0 | 161.0 | 64.0 | 5.0 | 10.0 | 199.0 | 207 | van |
| 285 | 89 | 48.0 | 85.0 | 189.0 | 64.0 | 8 | 169.0 | 39.0 | 20.0 | 153 | 188.0 | 427.0 | 190.0 | 64.0 | NaN | 5.0 | 195.0 | 201 | car |
| 287 | 88 | 43.0 | 84.0 | NaN | 55.0 | 11 | 154.0 | 44.0 | 19.0 | 150 | 174.0 | 350.0 | 164.0 | 73.0 | 6.0 | 2.0 | 185.0 | 196 | van |
| 308 | 109 | 51.0 | 100.0 | 197.0 | 59.0 | 10 | 192.0 | 34.0 | 22.0 | 161 | 210.0 | NaN | 195.0 | 64.0 | 14.0 | 3.0 | 196.0 | 202 | car |
| 319 | 102 | 51.0 | NaN | 194.0 | 60.0 | 6 | 220.0 | 30.0 | 25.0 | 162 | 247.0 | 731.0 | 209.0 | 80.0 | 7.0 | 7.0 | 188.0 | 186 | bus |
| 329 | 89 | 38.0 | 80.0 | 169.0 | 59.0 | 7 | 161.0 | 41.0 | 20.0 | 131 | 186.0 | 389.0 | 137.0 | NaN | 5.0 | 15.0 | 192.0 | 197 | car |
| 345 | 101 | 54.0 | 106.0 | NaN | 57.0 | 7 | 236.0 | 28.0 | 26.0 | 164 | 256.0 | 833.0 | 253.0 | 81.0 | 6.0 | 14.0 | 185.0 | 185 | bus |
| 372 | 97 | 47.0 | 87.0 | 164.0 | 64.0 | 9 | 156.0 | 43.0 | 20.0 | 149 | NaN | 359.0 | 182.0 | 68.0 | 1.0 | 13.0 | 192.0 | 202 | van |
| 396 | 108 | NaN | 106.0 | 177.0 | 51.0 | 5 | 256.0 | 26.0 | 28.0 | 170 | 285.0 | 966.0 | 261.0 | 87.0 | 11.0 | 2.0 | 182.0 | 181 | bus |
| 419 | 93 | 34.0 | 72.0 | 144.0 | 56.0 | 6 | 133.0 | 50.0 | 18.0 | 123 | 158.0 | 263.0 | 125.0 | 63.0 | 5.0 | 20.0 | NaN | 206 | car |
| 467 | 96 | 54.0 | 104.0 | NaN | 58.0 | 10 | 215.0 | 31.0 | 24.0 | 175 | 221.0 | 682.0 | 222.0 | 75.0 | 13.0 | 23.0 | 186.0 | 194 | car |
| 496 | 106 | 55.0 | 98.0 | 224.0 | 68.0 | 11 | 215.0 | 31.0 | 24.0 | 170 | 222.0 | NaN | 214.0 | 68.0 | 2.0 | 29.0 | 189.0 | 201 | car |
| 522 | 89 | 36.0 | 69.0 | 162.0 | 63.0 | 6 | 140.0 | 48.0 | 18.0 | 131 | NaN | 291.0 | 126.0 | 66.0 | 1.0 | 38.0 | 193.0 | 204 | car |
...# 1B.......... Check percentage of missing values
percent_missing = Vehicle_df.isnull().sum() * 100 / len(Vehicle_df)
missing_value_df = pd.DataFrame({'column_name': Vehicle_df.columns,
'percent_missing': percent_missing})
missing_value_df
| column_name | percent_missing | |
|---|---|---|
| compactness | compactness | 0.000000 |
| circularity | circularity | 0.591017 |
| distance_circularity | distance_circularity | 0.472813 |
| radius_ratio | radius_ratio | 0.709220 |
| pr.axis_aspect_ratio | pr.axis_aspect_ratio | 0.236407 |
| max.length_aspect_ratio | max.length_aspect_ratio | 0.000000 |
| scatter_ratio | scatter_ratio | 0.118203 |
| elongatedness | elongatedness | 0.118203 |
| pr.axis_rectangularity | pr.axis_rectangularity | 0.354610 |
| max.length_rectangularity | max.length_rectangularity | 0.000000 |
| scaled_variance | scaled_variance | 0.354610 |
| scaled_variance.1 | scaled_variance.1 | 0.236407 |
| scaled_radius_of_gyration | scaled_radius_of_gyration | 0.236407 |
| scaled_radius_of_gyration.1 | scaled_radius_of_gyration.1 | 0.472813 |
| skewness_about | skewness_about | 0.709220 |
| skewness_about.1 | skewness_about.1 | 0.118203 |
| skewness_about.2 | skewness_about.2 | 0.118203 |
| hollows_ratio | hollows_ratio | 0.000000 |
| class | class | 0.000000 |
for i in Vehicle_df.columns:
if ( Vehicle_df[i].isnull().sum() > 0):
median = Vehicle_df[i].median()
Vehicle_df[i].fillna(median, inplace=True) #..... Imputing with median
Vehicle_df.describe()
| compactness | circularity | distance_circularity | radius_ratio | pr.axis_aspect_ratio | max.length_aspect_ratio | scatter_ratio | elongatedness | pr.axis_rectangularity | max.length_rectangularity | scaled_variance | scaled_variance.1 | scaled_radius_of_gyration | scaled_radius_of_gyration.1 | skewness_about | skewness_about.1 | skewness_about.2 | hollows_ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 846.000000 | 846.000000 | 846.000000 | 846.000000 | 846.000000 | 846.000000 | 846.000000 | 846.000000 | 846.000000 | 846.000000 | 846.000000 | 846.000000 | 846.000000 | 846.000000 | 846.000000 | 846.000000 | 846.000000 | 846.000000 |
| mean | 93.678487 | 44.823877 | 82.100473 | 168.874704 | 61.677305 | 8.567376 | 168.887707 | 40.936170 | 20.580378 | 147.998818 | 188.596927 | 439.314421 | 174.706856 | 72.443262 | 6.361702 | 12.600473 | 188.918440 | 195.632388 |
| std | 8.234474 | 6.134272 | 15.741569 | 33.401356 | 7.882188 | 4.601217 | 33.197710 | 7.811882 | 2.588558 | 14.515652 | 31.360427 | 176.496341 | 32.546277 | 7.468734 | 4.903244 | 8.930962 | 6.152247 | 7.438797 |
| min | 73.000000 | 33.000000 | 40.000000 | 104.000000 | 47.000000 | 2.000000 | 112.000000 | 26.000000 | 17.000000 | 118.000000 | 130.000000 | 184.000000 | 109.000000 | 59.000000 | 0.000000 | 0.000000 | 176.000000 | 181.000000 |
| 25% | 87.000000 | 40.000000 | 70.000000 | 141.000000 | 57.000000 | 7.000000 | 147.000000 | 33.000000 | 19.000000 | 137.000000 | 167.000000 | 318.250000 | 149.000000 | 67.000000 | 2.000000 | 5.000000 | 184.000000 | 190.250000 |
| 50% | 93.000000 | 44.000000 | 80.000000 | 167.000000 | 61.000000 | 8.000000 | 157.000000 | 43.000000 | 20.000000 | 146.000000 | 179.000000 | 363.500000 | 173.500000 | 71.500000 | 6.000000 | 11.000000 | 188.000000 | 197.000000 |
| 75% | 100.000000 | 49.000000 | 98.000000 | 195.000000 | 65.000000 | 10.000000 | 198.000000 | 46.000000 | 23.000000 | 159.000000 | 217.000000 | 586.750000 | 198.000000 | 75.000000 | 9.000000 | 19.000000 | 193.000000 | 201.000000 |
| max | 119.000000 | 59.000000 | 112.000000 | 333.000000 | 138.000000 | 55.000000 | 265.000000 | 61.000000 | 29.000000 | 188.000000 | 320.000000 | 1018.000000 | 268.000000 | 135.000000 | 22.000000 | 41.000000 | 206.000000 | 211.000000 |
Vehicle_df.groupby('class')
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x00000244E6ABD640>
#1C... Visualize a pie chart for "Class"
Vehicle_df.groupby('class').size().plot(kind='pie', subplots=True, shadow=True, startangle=30, figsize=(8,6), autopct='%1.2f%%')
font1 = {'family':'serif','color':'blue','size':20}
plt.title("Class categories", fontdict = font1)
plt.tight_layout()
plt.show()
from itertools import cycle, islice
my_colors = list(islice(cycle(['b', 'r', 'g', 'y', 'k']), None, len(Vehicle_df['class'])))
pd.value_counts(Vehicle_df["class"]).plot(kind="bar",color=my_colors)
<AxesSubplot:>
#...1D......... checking for duplicated rows
Vehicle_df.duplicated().sum()
0
duplicate = Vehicle_df[Vehicle_df.duplicated()]
duplicate
| compactness | circularity | distance_circularity | radius_ratio | pr.axis_aspect_ratio | max.length_aspect_ratio | scatter_ratio | elongatedness | pr.axis_rectangularity | max.length_rectangularity | scaled_variance | scaled_variance.1 | scaled_radius_of_gyration | scaled_radius_of_gyration.1 | skewness_about | skewness_about.1 | skewness_about.2 | hollows_ratio | class |
|---|
There are no duplicate rows in the dataset.
#2A.... Splitting data into X & Y
X = Vehicle_df.drop(["class"], axis=1) #........... Independent variables,
y = Vehicle_df['class'] #.............. Dependent variable
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.20, random_state=1)
X_train
| compactness | circularity | distance_circularity | radius_ratio | pr.axis_aspect_ratio | max.length_aspect_ratio | scatter_ratio | elongatedness | pr.axis_rectangularity | max.length_rectangularity | scaled_variance | scaled_variance.1 | scaled_radius_of_gyration | scaled_radius_of_gyration.1 | skewness_about | skewness_about.1 | skewness_about.2 | hollows_ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 135 | 89 | 47.0 | 83.0 | 322.0 | 133.0 | 48 | 158.0 | 43.0 | 20.0 | 163 | 229.0 | 364.0 | 176.0 | 97.0 | 0.0 | 14.0 | 184.0 | 194 |
| 223 | 81 | 44.0 | 72.0 | 139.0 | 60.0 | 6 | 153.0 | 44.0 | 19.0 | 146 | 180.0 | 347.0 | 178.0 | 81.0 | 1.0 | 15.0 | 182.0 | 186 |
| 388 | 94 | 47.0 | 85.0 | 333.0 | 138.0 | 49 | 155.0 | 43.0 | 19.0 | 155 | 320.0 | 354.0 | 187.0 | 135.0 | 12.0 | 9.0 | 188.0 | 196 |
| 134 | 102 | 54.0 | 100.0 | 163.0 | 53.0 | 10 | 213.0 | 31.0 | 24.0 | 173 | 219.0 | 669.0 | 201.0 | 76.0 | 12.0 | 27.0 | 187.0 | 195 |
| 619 | 97 | 55.0 | 96.0 | 170.0 | 54.0 | 10 | 216.0 | 31.0 | 24.0 | 173 | 219.0 | 685.0 | 218.0 | 75.0 | 0.0 | 4.0 | 184.0 | 193 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 715 | 100 | 52.0 | 109.0 | 225.0 | 68.0 | 10 | 222.0 | 30.0 | 25.0 | 165 | 241.0 | 731.0 | 207.0 | 73.0 | 7.0 | 28.0 | 188.0 | 199 |
| 767 | 88 | 39.0 | 70.0 | 166.0 | 66.0 | 7 | 148.0 | 44.0 | 19.0 | 134 | 167.0 | 332.0 | 143.0 | 69.0 | 5.0 | 13.0 | 193.0 | 201 |
| 72 | 92 | 39.0 | 91.0 | 191.0 | 62.0 | 8 | 176.0 | 37.0 | 21.0 | 137 | 196.0 | 466.0 | 151.0 | 67.0 | 3.0 | 23.0 | 192.0 | 200 |
| 235 | 90 | 48.0 | 78.0 | 134.0 | 56.0 | 11 | 160.0 | 43.0 | 20.0 | 167 | 169.0 | 366.0 | 185.0 | 76.0 | 1.0 | 14.0 | 182.0 | 192 |
| 37 | 90 | 48.0 | 86.0 | 306.0 | 126.0 | 49 | 153.0 | 44.0 | 19.0 | 156 | 272.0 | 346.0 | 200.0 | 118.0 | 0.0 | 15.0 | 185.0 | 194 |
676 rows × 18 columns
y_train
135 van
223 bus
388 van
134 car
619 car
...
715 car
767 bus
72 car
235 van
37 van
Name: class, Length: 676, dtype: object
y.unique()
array(['van', 'car', 'bus'], dtype=object)
from scipy.stats import zscore
X_z = X.apply(zscore)
X_z
| compactness | circularity | distance_circularity | radius_ratio | pr.axis_aspect_ratio | max.length_aspect_ratio | scatter_ratio | elongatedness | pr.axis_rectangularity | max.length_rectangularity | scaled_variance | scaled_variance.1 | scaled_radius_of_gyration | scaled_radius_of_gyration.1 | skewness_about | skewness_about.1 | skewness_about.2 | hollows_ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.160580 | 0.518073 | 0.057177 | 0.273363 | 1.310398 | 0.311542 | -0.207598 | 0.136262 | -0.224342 | 0.758332 | -0.401920 | -0.341934 | 0.285705 | -0.327326 | -0.073812 | 0.380870 | -0.312012 | 0.183957 |
| 1 | -0.325470 | -0.623732 | 0.120741 | -0.835032 | -0.593753 | 0.094079 | -0.599423 | 0.520519 | -0.610886 | -0.344578 | -0.593357 | -0.619724 | -0.513630 | -0.059384 | 0.538390 | 0.156798 | 0.013265 | 0.452977 |
| 2 | 1.254193 | 0.844303 | 1.519141 | 1.202018 | 0.548738 | 0.311542 | 1.148719 | -1.144597 | 0.935290 | 0.689401 | 1.097671 | 1.109379 | 1.392477 | 0.074587 | 1.558727 | -0.403383 | -0.149374 | 0.049447 |
| 3 | -0.082445 | -0.623732 | -0.006386 | -0.295813 | 0.167907 | 0.094079 | -0.750125 | 0.648605 | -0.610886 | -0.344578 | -0.912419 | -0.738777 | -1.466683 | -1.265121 | -0.073812 | -0.291347 | 1.639649 | 1.529056 |
| 4 | -1.054545 | -0.134387 | -0.769150 | 1.082192 | 5.245643 | 9.444962 | -0.599423 | 0.520519 | -0.610886 | -0.275646 | 1.671982 | -0.648070 | 0.408680 | 7.309005 | 0.538390 | -0.179311 | -1.450481 | -1.699181 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 841 | -0.082445 | -0.949961 | 0.311432 | 0.423146 | 0.294851 | -0.123383 | 0.003385 | -0.119910 | -0.224342 | -0.964965 | 0.363829 | -0.098159 | -0.790323 | -0.059384 | 0.130256 | 1.389197 | -0.149374 | -0.085062 |
| 842 | -0.568495 | 0.191843 | 0.120741 | -0.175986 | 0.548738 | 0.529004 | -0.298019 | 0.264347 | -0.224342 | 0.758332 | -0.497638 | -0.404295 | 0.039756 | -0.059384 | -1.094148 | 0.829015 | -0.474650 | 0.183957 |
| 843 | 1.497218 | 1.496763 | 1.201323 | 1.591454 | 0.675681 | 0.746467 | 1.600825 | -1.400769 | 1.708378 | 1.723379 | 1.257202 | 1.596929 | 0.777604 | -0.327326 | -0.686013 | -0.963565 | -0.312012 | 0.721997 |
| 844 | -0.933032 | -1.439306 | -0.260641 | -0.685249 | -0.466810 | -0.340845 | -1.021388 | 1.160948 | -0.997430 | -1.654284 | -1.071950 | -0.959876 | -0.821066 | -0.863208 | -1.298215 | 1.389197 | 0.175903 | -0.085062 |
| 845 | -1.054545 | -1.439306 | -1.023405 | -1.374251 | -0.847640 | -0.775770 | -1.473494 | 1.929463 | -1.383974 | -1.378557 | -1.550542 | -1.288689 | -1.343709 | 0.074587 | -1.094148 | 0.604943 | -0.474650 | -0.757612 |
846 rows × 18 columns
X_train, X_test, y_train, y_test = train_test_split(X_z, y, test_size=.20, random_state=1)
#2B.... Standardizing the data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
X_train
array([[-0.58186122, 0.342899 , 0.04235553, ..., 0.16001899,
-0.7979766 , -0.21617737],
[-1.55067449, -0.14820211, -0.6498549 , ..., 0.27062517,
-1.12530195, -1.28091983],
[ 0.02364707, 0.342899 , 0.16821198, ..., -0.39301188,
-0.14332589, 0.05000824],
...,
[-0.21855625, -0.96670396, 0.5457813 , ..., 1.15547456,
0.51132481, 0.58237947],
[-0.46075956, 0.50659937, -0.27228557, ..., 0.16001899,
-1.12530195, -0.48236299],
[-0.46075956, 0.50659937, 0.2311402 , ..., 0.27062517,
-0.63431392, -0.21617737]])
#.... 3A........ Train a base classification model using SVM
from sklearn.svm import SVC
clf = SVC(kernel='linear')
# fitting x samples and y classes
clf.fit(X_train, y_train)
SVC(kernel='linear')
clf.score(X_train,y_train)
0.9585798816568047
y_pred_svm = clf.predict(X_test)
from sklearn import metrics
from sklearn.metrics import roc_auc_score
cm=metrics.confusion_matrix(y_test, y_pred_svm, labels=['van','bus','car'])
df_cm = pd.DataFrame(cm, index = [i for i in ["van","bus","car"]],
columns = [i for i in ["Predict van","Predict bus","Predict car"]])
plt.figure(figsize = (9,7))
sns.heatmap(df_cm, annot=True,cmap="YlGnBu")
<AxesSubplot:>
clf.score(X_test,y_test)
0.9529411764705882
y_pred_svm
array(['van', 'bus', 'bus', 'bus', 'car', 'car', 'van', 'van', 'van',
'van', 'van', 'van', 'bus', 'car', 'bus', 'bus', 'van', 'car',
'car', 'bus', 'car', 'car', 'car', 'car', 'van', 'van', 'bus',
'car', 'car', 'car', 'car', 'van', 'car', 'car', 'van', 'car',
'bus', 'car', 'bus', 'bus', 'car', 'van', 'van', 'car', 'van',
'van', 'car', 'car', 'car', 'car', 'car', 'car', 'bus', 'car',
'car', 'van', 'car', 'car', 'van', 'bus', 'car', 'van', 'car',
'van', 'bus', 'car', 'car', 'car', 'car', 'bus', 'car', 'van',
'car', 'van', 'bus', 'car', 'car', 'car', 'van', 'car', 'car',
'van', 'car', 'car', 'bus', 'car', 'van', 'car', 'bus', 'bus',
'car', 'bus', 'van', 'van', 'van', 'van', 'bus', 'bus', 'bus',
'car', 'car', 'car', 'car', 'van', 'van', 'car', 'car', 'car',
'car', 'car', 'car', 'bus', 'car', 'car', 'car', 'van', 'car',
'van', 'van', 'car', 'car', 'van', 'car', 'car', 'car', 'car',
'van', 'car', 'van', 'car', 'car', 'car', 'car', 'car', 'car',
'bus', 'bus', 'car', 'van', 'bus', 'car', 'van', 'car', 'van',
'bus', 'bus', 'van', 'bus', 'bus', 'car', 'car', 'car', 'car',
'car', 'car', 'car', 'car', 'van', 'bus', 'van', 'car', 'bus',
'car', 'car', 'bus', 'car', 'bus', 'van', 'car', 'bus'],
dtype=object)
#...... 3B....... Printing classification metrics for training data
print("SVC - Accuracy ",metrics.accuracy_score(y_test, y_pred_svm))
print("SVC - Precision",metrics.precision_score(y_test, y_pred_svm, average="weighted"))
print("SVC - Recall ",metrics.recall_score(y_test, y_pred_svm, average="weighted"))
print("SVC - F1 Score ",metrics.f1_score(y_test, y_pred_svm, average="weighted"))
SVC - Accuracy 0.9529411764705882 SVC - Precision 0.9530154486036839 SVC - Recall 0.9529411764705882 SVC - F1 Score 0.9529087591578909
Vehicle_df.cov() #.... checking covariance of the dataset against each column.
| compactness | circularity | distance_circularity | radius_ratio | pr.axis_aspect_ratio | max.length_aspect_ratio | scatter_ratio | elongatedness | pr.axis_rectangularity | max.length_rectangularity | scaled_variance | scaled_variance.1 | scaled_radius_of_gyration | scaled_radius_of_gyration.1 | skewness_about | skewness_about.1 | skewness_about.2 | hollows_ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| compactness | 67.806566 | 34.595378 | 102.393288 | 189.708781 | 5.941097 | 5.616954 | 222.142552 | -50.737706 | 17.344216 | 80.818554 | 196.794514 | 1183.048547 | 156.845875 | -15.350216 | 9.531814 | 11.547134 | 15.124042 | 22.391727 |
| circularity | 34.595378 | 37.629299 | 76.508841 | 127.220510 | 7.435406 | 7.097679 | 172.677241 | -39.365101 | 13.392280 | 85.598608 | 153.188097 | 905.058993 | 184.837067 | 2.379936 | 4.337152 | -0.626662 | -3.941009 | 2.115060 |
| distance_circularity | 102.393288 | 76.508841 | 247.796994 | 403.298994 | 19.660863 | 19.171329 | 472.978160 | -112.064585 | 36.388956 | 176.978817 | 425.299717 | 2461.647662 | 361.587476 | -26.564115 | 8.793202 | 37.332497 | 14.149033 | 38.962423 |
| radius_ratio | 189.708781 | 127.220510 | 403.298994 | 1115.650555 | 174.669579 | 69.167032 | 814.370529 | -205.997356 | 61.247953 | 275.850739 | 831.086715 | 4235.335892 | 583.084529 | -45.002975 | 7.977918 | 51.827988 | 78.542431 | 117.104181 |
| pr.axis_aspect_ratio | 5.941097 | 7.435406 | 19.660863 | 174.669579 | 62.128881 | 23.527685 | 27.143602 | -11.270326 | 1.624193 | 14.520328 | 67.460309 | 124.077322 | 31.289907 | 9.004155 | -2.255923 | -2.250972 | 11.632821 | 15.697801 |
| max.length_aspect_ratio | 5.616954 | 7.097679 | 19.171329 | 69.167032 | 23.527685 | 21.171195 | 25.385681 | -6.474984 | 1.923572 | 20.433808 | 46.024231 | 116.335595 | 28.414449 | 10.162999 | 0.351933 | 1.784347 | -0.738285 | 4.925981 |
| scatter_ratio | 222.142552 | 172.677241 | 472.978160 | 814.370529 | 27.143602 | 25.385681 | 1102.087967 | -251.971673 | 85.053415 | 389.886258 | 987.646992 | 5818.327065 | 864.233907 | -6.828864 | 12.119955 | 62.982302 | 1.149410 | 29.342103 |
| elongatedness | -50.737706 | -39.365101 | -112.064585 | -205.997356 | -11.270326 | -6.474984 | -251.971673 | 61.025507 | -19.190130 | -87.977590 | -229.398540 | -1315.091741 | -194.833526 | 6.027143 | -2.014755 | -12.910739 | -5.533023 | -12.604557 |
| pr.axis_rectangularity | 17.344216 | 13.392280 | 36.388956 | 61.247953 | 1.624193 | 1.923572 | 85.053415 | -19.190130 | 6.700632 | 30.470509 | 75.838946 | 451.485940 | 67.119448 | -0.299576 | 1.063200 | 4.963512 | -0.296987 | 1.911832 |
| max.length_rectangularity | 80.818554 | 85.598608 | 176.978817 | 275.850739 | 14.520328 | 20.433808 | 389.886258 | -87.977590 | 30.470509 | 210.704141 | 339.129701 | 2035.770195 | 409.337523 | 4.512359 | 9.669067 | 0.177042 | -9.282937 | 8.289506 |
| scaled_variance | 196.794514 | 153.188097 | 425.299717 | 831.086715 | 67.460309 | 46.024231 | 987.646992 | -229.398540 | 75.838946 | 339.129701 | 983.476393 | 5234.325701 | 795.013063 | 26.485388 | 5.647740 | 54.402084 | 2.743418 | 19.991295 |
| scaled_variance.1 | 1183.048547 | 905.058993 | 2461.647662 | 4235.335892 | 124.077322 | 116.335595 | 5818.327065 | -1315.091741 | 451.485940 | 2035.770195 | 5234.325701 | 31150.958419 | 4566.814765 | -20.301074 | 66.529926 | 316.534052 | 6.752894 | 135.145899 |
| scaled_radius_of_gyration | 156.845875 | 184.837067 | 361.587476 | 583.084529 | 31.289907 | 28.414449 | 864.233907 | -194.833526 | 67.119448 | 409.337523 | 795.013063 | 4566.814765 | 1059.260119 | 46.543111 | 26.567695 | -16.321991 | -44.942280 | -28.568838 |
| scaled_radius_of_gyration.1 | -15.350216 | 2.379936 | -26.564115 | -45.002975 | 9.004155 | 10.162999 | -6.828864 | 6.027143 | -0.299576 | 4.512359 | 26.485388 | -20.301074 | 46.543111 | 55.781984 | -3.235667 | -8.416778 | -34.409958 | -44.564669 |
| skewness_about | 9.531814 | 4.337152 | 8.793202 | 7.977918 | -2.255923 | 0.351933 | 12.119955 | -2.014755 | 1.063200 | 9.669067 | 5.647740 | 66.529926 | 26.567695 | -3.235667 | 24.041798 | -1.532242 | 3.478056 | 3.542591 |
| skewness_about.1 | 11.547134 | -0.626662 | 37.332497 | 51.827988 | -2.250972 | 1.784347 | 62.982302 | -12.910739 | 4.963512 | 0.177042 | 54.402084 | 316.534052 | -16.321991 | -8.416778 | -1.532242 | 79.762083 | 4.247849 | 13.618636 |
| skewness_about.2 | 15.124042 | -3.941009 | 14.149033 | 78.542431 | 11.632821 | -0.738285 | 1.149410 | -5.533023 | -0.296987 | -9.282937 | 2.743418 | 6.752894 | -44.942280 | -34.409958 | 3.478056 | 4.247849 | 37.850145 | 40.849272 |
| hollows_ratio | 22.391727 | 2.115060 | 38.962423 | 117.104181 | 15.697801 | 4.925981 | 29.342103 | -12.604557 | 1.911832 | 8.289506 | 19.991295 | 135.145899 | -28.568838 | -44.564669 | 3.542591 | 13.618636 | 40.849272 | 55.335707 |
# Check for correlation of variable
sns.set(rc={'figure.figsize':(20.7,15.27)})
sns.heatmap(Vehicle_df.corr(method='pearson'),annot=True,cmap="YlGnBu");
sns.set(rc={'figure.figsize':(18.7,1200.27)})
sns. set(style="ticks", color_codes=True)
sns.pairplot(Vehicle_df, hue='class');
Vehicle_df.describe()
| compactness | circularity | distance_circularity | radius_ratio | pr.axis_aspect_ratio | max.length_aspect_ratio | scatter_ratio | elongatedness | pr.axis_rectangularity | max.length_rectangularity | scaled_variance | scaled_variance.1 | scaled_radius_of_gyration | scaled_radius_of_gyration.1 | skewness_about | skewness_about.1 | skewness_about.2 | hollows_ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 846.000000 | 846.000000 | 846.000000 | 846.000000 | 846.000000 | 846.000000 | 846.000000 | 846.000000 | 846.000000 | 846.000000 | 846.000000 | 846.000000 | 846.000000 | 846.000000 | 846.000000 | 846.000000 | 846.000000 | 846.000000 |
| mean | 93.678487 | 44.823877 | 82.100473 | 168.874704 | 61.677305 | 8.567376 | 168.887707 | 40.936170 | 20.580378 | 147.998818 | 188.596927 | 439.314421 | 174.706856 | 72.443262 | 6.361702 | 12.600473 | 188.918440 | 195.632388 |
| std | 8.234474 | 6.134272 | 15.741569 | 33.401356 | 7.882188 | 4.601217 | 33.197710 | 7.811882 | 2.588558 | 14.515652 | 31.360427 | 176.496341 | 32.546277 | 7.468734 | 4.903244 | 8.930962 | 6.152247 | 7.438797 |
| min | 73.000000 | 33.000000 | 40.000000 | 104.000000 | 47.000000 | 2.000000 | 112.000000 | 26.000000 | 17.000000 | 118.000000 | 130.000000 | 184.000000 | 109.000000 | 59.000000 | 0.000000 | 0.000000 | 176.000000 | 181.000000 |
| 25% | 87.000000 | 40.000000 | 70.000000 | 141.000000 | 57.000000 | 7.000000 | 147.000000 | 33.000000 | 19.000000 | 137.000000 | 167.000000 | 318.250000 | 149.000000 | 67.000000 | 2.000000 | 5.000000 | 184.000000 | 190.250000 |
| 50% | 93.000000 | 44.000000 | 80.000000 | 167.000000 | 61.000000 | 8.000000 | 157.000000 | 43.000000 | 20.000000 | 146.000000 | 179.000000 | 363.500000 | 173.500000 | 71.500000 | 6.000000 | 11.000000 | 188.000000 | 197.000000 |
| 75% | 100.000000 | 49.000000 | 98.000000 | 195.000000 | 65.000000 | 10.000000 | 198.000000 | 46.000000 | 23.000000 | 159.000000 | 217.000000 | 586.750000 | 198.000000 | 75.000000 | 9.000000 | 19.000000 | 193.000000 | 201.000000 |
| max | 119.000000 | 59.000000 | 112.000000 | 333.000000 | 138.000000 | 55.000000 | 265.000000 | 61.000000 | 29.000000 | 188.000000 | 320.000000 | 1018.000000 | 268.000000 | 135.000000 | 22.000000 | 41.000000 | 206.000000 | 211.000000 |
Vehicle_df.boxplot(figsize=(100,20),patch_artist=True,boxprops=dict(facecolor='r'))
<AxesSubplot:>
sns.set(rc={'figure.figsize':(8,5)})
for column in Vehicle_df:
if column == 'class':
print("")
else:
print("column is",column)
plt.figure()
Vehicle_df.boxplot([column])
plt.show()
column is compactness
column is circularity
column is distance_circularity
column is radius_ratio
column is pr.axis_aspect_ratio
column is max.length_aspect_ratio
column is scatter_ratio
column is elongatedness
column is pr.axis_rectangularity
column is max.length_rectangularity
column is scaled_variance
column is scaled_variance.1
column is scaled_radius_of_gyration
column is scaled_radius_of_gyration.1
column is skewness_about
column is skewness_about.1
column is skewness_about.2
column is hollows_ratio
The following features have outliers
# Drop class variable
Vehicle_new = Vehicle_df.drop(['class'], axis =1)
Vehicle_new.head()
| compactness | circularity | distance_circularity | radius_ratio | pr.axis_aspect_ratio | max.length_aspect_ratio | scatter_ratio | elongatedness | pr.axis_rectangularity | max.length_rectangularity | scaled_variance | scaled_variance.1 | scaled_radius_of_gyration | scaled_radius_of_gyration.1 | skewness_about | skewness_about.1 | skewness_about.2 | hollows_ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 95 | 48.0 | 83.0 | 178.0 | 72.0 | 10 | 162.0 | 42.0 | 20.0 | 159 | 176.0 | 379.0 | 184.0 | 70.0 | 6.0 | 16.0 | 187.0 | 197 |
| 1 | 91 | 41.0 | 84.0 | 141.0 | 57.0 | 9 | 149.0 | 45.0 | 19.0 | 143 | 170.0 | 330.0 | 158.0 | 72.0 | 9.0 | 14.0 | 189.0 | 199 |
| 2 | 104 | 50.0 | 106.0 | 209.0 | 66.0 | 10 | 207.0 | 32.0 | 23.0 | 158 | 223.0 | 635.0 | 220.0 | 73.0 | 14.0 | 9.0 | 188.0 | 196 |
| 3 | 93 | 41.0 | 82.0 | 159.0 | 63.0 | 9 | 144.0 | 46.0 | 19.0 | 143 | 160.0 | 309.0 | 127.0 | 63.0 | 6.0 | 10.0 | 199.0 | 207 |
| 4 | 85 | 44.0 | 70.0 | 205.0 | 103.0 | 52 | 149.0 | 45.0 | 19.0 | 144 | 241.0 | 325.0 | 188.0 | 127.0 | 9.0 | 11.0 | 180.0 | 183 |
Vehicle_new=Vehicle_new.apply(zscore)
Vehicle_new.head()
| compactness | circularity | distance_circularity | radius_ratio | pr.axis_aspect_ratio | max.length_aspect_ratio | scatter_ratio | elongatedness | pr.axis_rectangularity | max.length_rectangularity | scaled_variance | scaled_variance.1 | scaled_radius_of_gyration | scaled_radius_of_gyration.1 | skewness_about | skewness_about.1 | skewness_about.2 | hollows_ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.160580 | 0.518073 | 0.057177 | 0.273363 | 1.310398 | 0.311542 | -0.207598 | 0.136262 | -0.224342 | 0.758332 | -0.401920 | -0.341934 | 0.285705 | -0.327326 | -0.073812 | 0.380870 | -0.312012 | 0.183957 |
| 1 | -0.325470 | -0.623732 | 0.120741 | -0.835032 | -0.593753 | 0.094079 | -0.599423 | 0.520519 | -0.610886 | -0.344578 | -0.593357 | -0.619724 | -0.513630 | -0.059384 | 0.538390 | 0.156798 | 0.013265 | 0.452977 |
| 2 | 1.254193 | 0.844303 | 1.519141 | 1.202018 | 0.548738 | 0.311542 | 1.148719 | -1.144597 | 0.935290 | 0.689401 | 1.097671 | 1.109379 | 1.392477 | 0.074587 | 1.558727 | -0.403383 | -0.149374 | 0.049447 |
| 3 | -0.082445 | -0.623732 | -0.006386 | -0.295813 | 0.167907 | 0.094079 | -0.750125 | 0.648605 | -0.610886 | -0.344578 | -0.912419 | -0.738777 | -1.466683 | -1.265121 | -0.073812 | -0.291347 | 1.639649 | 1.529056 |
| 4 | -1.054545 | -0.134387 | -0.769150 | 1.082192 | 5.245643 | 9.444962 | -0.599423 | 0.520519 | -0.610886 | -0.275646 | 1.671982 | -0.648070 | 0.408680 | 7.309005 | 0.538390 | -0.179311 | -1.450481 | -1.699181 |
#..... 3C....... Applying PCA with 10 components.
# Using scikit learn PCA here. It does all the above steps and maps data to PCA dimensions in one shot
from sklearn.decomposition import PCA
# NOTE - we are generating only 10 PCA dimensions (dimensionality reduction from 18 to 10)
pca = PCA(n_components=10)
pca.fit(Vehicle_new)
PCA(n_components=10)
print(pca.explained_variance_)
[9.40460261 3.01492206 1.90352502 1.17993747 0.91726063 0.53999263 0.35887012 0.22193246 0.1606086 0.09185722]
pca.components_
array([[ 2.75283688e-01, 2.93258469e-01, 3.04609128e-01,
2.67606877e-01, 8.05039890e-02, 9.72756855e-02,
3.17092750e-01, -3.14133155e-01, 3.13959064e-01,
2.82830900e-01, 3.09280359e-01, 3.13788457e-01,
2.72047492e-01, -2.08137692e-02, 4.14555082e-02,
5.82250207e-02, 3.02795063e-02, 7.41453913e-02],
[-1.26953763e-01, 1.25576727e-01, -7.29516436e-02,
-1.89634378e-01, -1.22174860e-01, 1.07482875e-02,
4.81181371e-02, 1.27498515e-02, 5.99352482e-02,
1.16220532e-01, 6.22806229e-02, 5.37843596e-02,
2.09233172e-01, 4.88525148e-01, -5.50899716e-02,
-1.24085090e-01, -5.40914775e-01, -5.40354258e-01],
[-1.19922479e-01, -2.48205467e-02, -5.60143254e-02,
2.75074211e-01, 6.42012966e-01, 5.91801304e-01,
-9.76283108e-02, 5.76484384e-02, -1.09512416e-01,
-1.70641987e-02, 5.63239801e-02, -1.08840729e-01,
-3.14636493e-02, 2.86277015e-01, -1.15679354e-01,
-7.52828901e-02, 8.73592034e-03, 3.95242743e-02],
[ 7.83843562e-02, 1.87337408e-01, -7.12008427e-02,
-4.26053415e-02, 3.27257119e-02, 3.14147277e-02,
-9.57485748e-02, 8.22901952e-02, -9.24582989e-02,
1.88005612e-01, -1.19844008e-01, -9.17449325e-02,
2.00095228e-01, -6.55051354e-02, 6.04794251e-01,
-6.66114117e-01, 1.05526253e-01, 4.74890311e-02],
[ 6.95178336e-02, -8.50649539e-02, 4.06645651e-02,
-4.61473714e-02, -4.05494487e-02, 2.13432566e-01,
-1.54853055e-02, 7.68518712e-02, 2.17633157e-03,
-6.06366845e-02, -4.56472367e-04, -1.95548315e-02,
-6.15991681e-02, 1.45530146e-01, 7.29189842e-01,
5.99196401e-01, -1.00602332e-01, -2.98614819e-02],
[ 1.44875476e-01, -3.02731148e-01, -1.38405773e-01,
2.48136636e-01, 2.36932611e-01, -4.19330747e-01,
1.16100153e-01, -1.41840112e-01, 9.80561329e-02,
-4.61674972e-01, 2.36225434e-01, 1.57820194e-01,
-1.35576278e-01, 2.41356821e-01, 2.03209257e-01,
-1.91960802e-01, 1.56939174e-01, -2.41222817e-01],
[ 4.51862331e-01, -2.49103387e-01, 7.40350569e-02,
-1.76912814e-01, -3.97876601e-01, 5.03413610e-01,
6.49879382e-02, 1.38112945e-02, 9.66573058e-02,
-1.04552173e-01, 1.14622578e-01, 8.37350220e-02,
-3.73944382e-01, 1.11952983e-01, -8.06328902e-02,
-2.84558723e-01, 1.81451818e-02, 1.57237839e-02],
[-5.66136785e-01, -1.79851809e-01, 4.34748988e-01,
1.01998360e-01, -6.87147927e-02, 1.61153097e-01,
1.00688056e-01, -2.15497166e-01, 6.35933915e-02,
-2.49495867e-01, 5.02096319e-02, 4.37649907e-02,
-1.08474496e-01, -3.40878491e-01, 1.56487670e-01,
-2.08774083e-01, -3.04580219e-01, -3.04186304e-02],
[-4.84418105e-01, -1.41569001e-02, -1.67572478e-01,
-2.30313563e-01, -2.77128307e-01, 1.48032250e-01,
5.44574214e-02, -1.56867362e-01, 5.24978759e-03,
-6.10362445e-02, 2.97588112e-01, 8.33669838e-02,
2.41655483e-01, 3.20221887e-01, 2.21054148e-02,
1.01761758e-02, 5.17222779e-01, 1.71506343e-01],
[-2.60076393e-01, 9.80779086e-02, -2.05031597e-01,
-4.77888949e-02, 1.08075009e-01, -1.18266345e-01,
1.65167200e-01, -1.51612333e-01, 1.93777917e-01,
4.69059999e-01, -1.29986011e-01, 1.58203940e-01,
-6.86493700e-01, 1.27648385e-01, 9.83643219e-02,
-3.55150608e-02, 1.93956186e-02, 6.41314778e-02]])
print(pca.explained_variance_ratio_)
[0.52186034 0.16729768 0.10562639 0.0654746 0.05089869 0.02996413 0.01991366 0.01231501 0.00891215 0.00509715]
Xpca10 = pca.transform(Vehicle_new)
Xpca10
array([[ 3.34162030e-01, -2.19026358e-01, 1.00158417e+00, ...,
-3.81106357e-01, -8.66309530e-01, 9.15114442e-02],
[-1.59171085e+00, -4.20602982e-01, -3.69033854e-01, ...,
2.47058909e-01, 1.47249715e-01, -9.37944293e-02],
[ 3.76932418e+00, 1.95282752e-01, 8.78587404e-02, ...,
4.82771767e-01, -3.10832555e-01, -4.67615341e-01],
...,
[ 4.80917387e+00, -1.24931049e-03, 5.32333105e-01, ...,
1.10477865e-01, -6.52536352e-01, 5.56591558e-01],
[-3.29409242e+00, -1.00827615e+00, -3.57003198e-01, ...,
3.20621635e-01, -2.01263247e-01, -8.74536682e-01],
[-4.76505347e+00, 3.34899728e-01, -5.68136078e-01, ...,
-2.48034955e-01, -4.29903644e-01, -2.99232676e-01]])
sns.pairplot(pd.DataFrame(Xpca10))
<seaborn.axisgrid.PairGrid at 0x244e3a85880>
X_train_1a, X_test_1a, y_train_1a, y_test_1a = train_test_split(Xpca10, y, test_size=.20, random_state=1)
X_train_1a = scaler.fit_transform(X_train_1a)
X_test_1a = scaler.transform(X_test_1a)
clf10 = SVC(kernel='linear')
# fitting x samples and y classes
clf10.fit(X_train_1a, y_train_1a)
print("Train score with PCA 10 features",clf10.score(X_train_1a,y_train_1a))
y_pred_svm_1a = clf10.predict(X_test_1a)
print("Test score with PCA 10 features",clf10.score(X_test_1a,y_test_1a))
print("Predicted Y with PCA 10 features",y_pred_svm_1a)
print("SVC with PCA 10 features - Accuracy ",metrics.accuracy_score(y_test_1a, y_pred_svm_1a))
print("SVC with PCA 10 features - Precision",metrics.precision_score(y_test_1a, y_pred_svm_1a, average="weighted"))
print("SVC with PCA 10 features - Recall ",metrics.recall_score(y_test_1a, y_pred_svm_1a, average="weighted"))
print("SVC with PCA 10 features - F1 Score ",metrics.f1_score(y_test_1a, y_pred_svm_1a, average="weighted"))
Train score with PCA 10 features 0.9186390532544378 Test score with PCA 10 features 0.9176470588235294 Predicted Y with PCA 10 features ['van' 'bus' 'bus' 'bus' 'car' 'car' 'van' 'van' 'van' 'van' 'van' 'van' 'car' 'bus' 'bus' 'car' 'van' 'car' 'van' 'bus' 'car' 'car' 'car' 'car' 'van' 'van' 'bus' 'car' 'car' 'car' 'car' 'van' 'car' 'car' 'van' 'car' 'bus' 'car' 'bus' 'bus' 'car' 'van' 'van' 'car' 'van' 'van' 'car' 'bus' 'car' 'car' 'car' 'car' 'bus' 'car' 'car' 'van' 'car' 'car' 'van' 'bus' 'car' 'van' 'car' 'van' 'bus' 'car' 'car' 'car' 'car' 'bus' 'car' 'van' 'car' 'van' 'bus' 'car' 'bus' 'car' 'van' 'car' 'car' 'van' 'car' 'car' 'bus' 'car' 'van' 'car' 'bus' 'bus' 'car' 'bus' 'van' 'bus' 'van' 'van' 'bus' 'bus' 'bus' 'car' 'car' 'car' 'car' 'van' 'van' 'car' 'car' 'car' 'car' 'car' 'car' 'bus' 'car' 'car' 'car' 'van' 'car' 'van' 'van' 'car' 'car' 'van' 'car' 'car' 'bus' 'car' 'van' 'car' 'van' 'car' 'car' 'car' 'car' 'car' 'car' 'bus' 'bus' 'car' 'car' 'bus' 'car' 'van' 'car' 'van' 'bus' 'bus' 'bus' 'car' 'bus' 'car' 'car' 'car' 'car' 'car' 'car' 'car' 'car' 'van' 'bus' 'van' 'car' 'bus' 'bus' 'car' 'bus' 'car' 'bus' 'van' 'car' 'bus'] SVC with PCA 10 features - Accuracy 0.9176470588235294 SVC with PCA 10 features - Precision 0.9197930990578049 SVC with PCA 10 features - Recall 0.9176470588235294 SVC with PCA 10 features - F1 Score 0.9183410998239349
list(range(1,11))
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
plt.bar(list(range(1,11)),pca.explained_variance_ratio_,alpha=0.5, align='center')
plt.ylabel('Variation explained')
plt.xlabel('eigen Value')
plt.show()
#....3D........ Visualizing cumulative variance explained and drawing a horizontal line at 90%
plt.step(list(range(1,11)),np.cumsum(pca.explained_variance_ratio_), where='mid')
plt.ylabel('Cum of variation explained')
plt.xlabel('eigen Value')
plt.axhline(y = 0.9, color = 'r', linestyle = '-')
plt.show()
# Dimensionality Reduction
#Now 5 dimensions seems very reasonable. With 5 variables we can explain over 90% of the variation in the original data!
#..... 3F......... Applying PCA on 5 components
pca5 = PCA(n_components=5)
pca5.fit(Vehicle_new)
print(pca5.components_)
print(pca5.explained_variance_ratio_)
Xpca5 = pca5.transform(Vehicle_new)
[[ 2.75283688e-01 2.93258469e-01 3.04609128e-01 2.67606877e-01 8.05039890e-02 9.72756855e-02 3.17092750e-01 -3.14133155e-01 3.13959064e-01 2.82830900e-01 3.09280359e-01 3.13788457e-01 2.72047492e-01 -2.08137692e-02 4.14555082e-02 5.82250207e-02 3.02795063e-02 7.41453913e-02] [-1.26953763e-01 1.25576727e-01 -7.29516436e-02 -1.89634378e-01 -1.22174860e-01 1.07482875e-02 4.81181371e-02 1.27498515e-02 5.99352482e-02 1.16220532e-01 6.22806229e-02 5.37843596e-02 2.09233172e-01 4.88525148e-01 -5.50899716e-02 -1.24085090e-01 -5.40914775e-01 -5.40354258e-01] [-1.19922479e-01 -2.48205467e-02 -5.60143254e-02 2.75074210e-01 6.42012966e-01 5.91801304e-01 -9.76283108e-02 5.76484384e-02 -1.09512416e-01 -1.70641987e-02 5.63239801e-02 -1.08840729e-01 -3.14636493e-02 2.86277015e-01 -1.15679354e-01 -7.52828901e-02 8.73592033e-03 3.95242743e-02] [ 7.83843562e-02 1.87337408e-01 -7.12008427e-02 -4.26053417e-02 3.27257121e-02 3.14147276e-02 -9.57485748e-02 8.22901953e-02 -9.24582991e-02 1.88005612e-01 -1.19844007e-01 -9.17449325e-02 2.00095228e-01 -6.55051356e-02 6.04794251e-01 -6.66114117e-01 1.05526253e-01 4.74890312e-02] [ 6.95178336e-02 -8.50649539e-02 4.06645651e-02 -4.61473716e-02 -4.05494486e-02 2.13432566e-01 -1.54853055e-02 7.68518712e-02 2.17633151e-03 -6.06366845e-02 -4.56471929e-04 -1.95548316e-02 -6.15991682e-02 1.45530146e-01 7.29189842e-01 5.99196401e-01 -1.00602332e-01 -2.98614818e-02]] [0.52186034 0.16729768 0.10562639 0.0654746 0.05089869]
Xpca5
array([[ 3.34162030e-01, -2.19026358e-01, 1.00158417e+00,
1.76612370e-01, 7.93007081e-02],
[-1.59171085e+00, -4.20602982e-01, -3.69033854e-01,
2.33234117e-01, 6.93948582e-01],
[ 3.76932418e+00, 1.95282752e-01, 8.78587404e-02,
1.20221219e+00, 7.31732265e-01],
...,
[ 4.80917387e+00, -1.24931049e-03, 5.32333105e-01,
2.95652324e-01, -1.34423635e+00],
[-3.29409242e+00, -1.00827615e+00, -3.57003198e-01,
-1.93367514e+00, 4.27680050e-02],
[-4.76505347e+00, 3.34899728e-01, -5.68136078e-01,
-1.22480708e+00, -5.40510368e-02]])
sns.pairplot(pd.DataFrame(Xpca5))
<seaborn.axisgrid.PairGrid at 0x244eaf9c6a0>
Xpca5.shape
(846, 5)
y.shape
(846,)
X_train_1, X_test_1, y_train_1, y_test_1 = train_test_split(Xpca5, y, test_size=.20, random_state=1)
#... 3G........ Training SVC with PCA 5 components
X_train_1 = scaler.fit_transform(X_train_1)
X_test_1 = scaler.transform(X_test_1)
clf = SVC(kernel='linear')
# fitting x samples and y classes
clf.fit(X_train_1, y_train_1)
print("Train score with PCA 5 features",clf.score(X_train_1,y_train_1))
y_pred_svm_1 = clf.predict(X_test_1)
print("Test score with PCA 5 features",clf.score(X_test_1,y_test_1))
print("Predicted Y with PCA 5 features",y_pred_svm_1)
print("SVC with PCA 5 features - Accuracy ",metrics.accuracy_score(y_test_1, y_pred_svm_1))
print("SVC with PCA 5 features - Precision",metrics.precision_score(y_test_1, y_pred_svm_1, average="weighted"))
print("SVC with PCA 5 features - Recall ",metrics.recall_score(y_test_1, y_pred_svm_1, average="weighted"))
print("SVC with PCA 5 features - F1 Score ",metrics.f1_score(y_test_1, y_pred_svm_1, average="weighted"))
Train score with PCA 5 features 0.6878698224852071 Test score with PCA 5 features 0.6705882352941176 Predicted Y with PCA 5 features ['bus' 'car' 'car' 'car' 'car' 'car' 'car' 'van' 'bus' 'car' 'van' 'van' 'bus' 'bus' 'van' 'car' 'car' 'bus' 'van' 'bus' 'car' 'car' 'car' 'car' 'van' 'van' 'car' 'car' 'car' 'car' 'van' 'car' 'car' 'car' 'van' 'car' 'car' 'car' 'car' 'bus' 'car' 'bus' 'van' 'car' 'bus' 'van' 'bus' 'bus' 'car' 'car' 'car' 'van' 'car' 'car' 'car' 'bus' 'car' 'car' 'bus' 'bus' 'car' 'van' 'car' 'van' 'bus' 'car' 'car' 'car' 'car' 'bus' 'car' 'car' 'car' 'van' 'bus' 'van' 'car' 'van' 'van' 'car' 'bus' 'bus' 'car' 'car' 'bus' 'car' 'van' 'car' 'bus' 'bus' 'car' 'van' 'bus' 'van' 'van' 'van' 'bus' 'van' 'car' 'car' 'car' 'car' 'van' 'van' 'bus' 'car' 'car' 'car' 'car' 'car' 'car' 'car' 'car' 'car' 'car' 'van' 'car' 'van' 'bus' 'car' 'car' 'car' 'car' 'van' 'van' 'car' 'van' 'car' 'van' 'van' 'car' 'car' 'car' 'car' 'car' 'bus' 'bus' 'car' 'bus' 'bus' 'car' 'car' 'car' 'car' 'car' 'car' 'bus' 'van' 'bus' 'car' 'car' 'car' 'car' 'car' 'bus' 'car' 'car' 'van' 'bus' 'car' 'car' 'bus' 'van' 'car' 'bus' 'car' 'bus' 'car' 'bus' 'van'] SVC with PCA 5 features - Accuracy 0.6705882352941176 SVC with PCA 5 features - Precision 0.6661869299640197 SVC with PCA 5 features - Recall 0.6705882352941176 SVC with PCA 5 features - F1 Score 0.6672787493375729
It is observed that after applying the Dimensionality reduction by PCA 5 components, the train score and test score has reduced
Train score with PCA 5 features 0.6878698224852071 Test score with PCA 5 features 0.6705882352941176
the first SVC had given the following scores
Training set without PCA with 18 original features - 0.9585798816568047 Testing set without PCA with 18 original features - 0.9529411764705882
So even though the accurancy scores have reduced after applying PCA 5 components, the variance is still good
Training - 0.6878 & Testing - 0.6705
#...4A... Tuning SVC with hyper parameters by changing values of kernel, C & gamma
clf = SVC(kernel='rbf',C = 1.0,gamma='auto')
# fitting x samples and y classes
clf.fit(X_train_1, y_train_1)
y_pred_svm = clf.predict(X_test_1)
print("SVC Tuning with PCA 5 feateures & rbf, C=1, gamma='auto' - Accuracy ",metrics.accuracy_score(y_test_1, y_pred_svm))
print("SVC Tuning with PCA 5 feateures & rbf, C=1, gamma='auto' - Precision",metrics.precision_score(y_test_1, y_pred_svm, average="weighted"))
print("SVC Tuning with PCA 5 feateures & rbf, C=1, gamma='auto' - Recall ",metrics.recall_score(y_test_1, y_pred_svm, average="weighted"))
print("SVC Tuning with PCA 5 feateures & rbf, C=1, gamma='auto' - F1 Score ",metrics.f1_score(y_test_1, y_pred_svm, average="weighted"))
SVC Tuning with PCA 5 feateures & rbf, C=1, gamma='auto' - Accuracy 0.7647058823529411 SVC Tuning with PCA 5 feateures & rbf, C=1, gamma='auto' - Precision 0.7680456095481671 SVC Tuning with PCA 5 feateures & rbf, C=1, gamma='auto' - Recall 0.7647058823529411 SVC Tuning with PCA 5 feateures & rbf, C=1, gamma='auto' - F1 Score 0.7641675269418904
clf = SVC(kernel='rbf',C = 1.0,gamma=0.1)
# fitting x samples and y classes
clf.fit(X_train_1, y_train_1)
y_pred_svm = clf.predict(X_test_1)
print("SVC Tuning with PCA 5 feateures rbf, C=1, gamma=0.1 - Accuracy ",metrics.accuracy_score(y_test_1, y_pred_svm))
print("SVC Tuning with PCA 5 feateures rbf, C=1, gamma=0.1 - Precision",metrics.precision_score(y_test_1, y_pred_svm, average="weighted"))
print("SVC Tuning with PCA 5 feateures rbf, C=1, gamma=0.1 - Recall ",metrics.recall_score(y_test_1, y_pred_svm, average="weighted"))
print("SVC Tuning with PCA 5 feateures rbf, C=1, gamma=0.1 - F1 Score ",metrics.f1_score(y_test_1, y_pred_svm, average="weighted"))
SVC Tuning with PCA 5 feateures rbf, C=1, gamma=0.1 - Accuracy 0.7352941176470589 SVC Tuning with PCA 5 feateures rbf, C=1, gamma=0.1 - Precision 0.7350656051894442 SVC Tuning with PCA 5 feateures rbf, C=1, gamma=0.1 - Recall 0.7352941176470589 SVC Tuning with PCA 5 feateures rbf, C=1, gamma=0.1 - F1 Score 0.7351326412918109
clf = SVC(kernel='rbf',C = 10.0,gamma=0.1)
# fitting x samples and y classes
clf.fit(X_train_1, y_train_1)
y_pred_svm = clf.predict(X_test_1)
print("SVC Tuning with PCA 5 feateures rbf, C=10, gamma=0.1 - Accuracy ",metrics.accuracy_score(y_test_1, y_pred_svm))
print("SVC Tuning with PCA 5 feateures rbf, C=10, gamma=0.1 - Precision",metrics.precision_score(y_test_1, y_pred_svm, average="weighted"))
print("SVC Tuning with PCA 5 feateures rbf, C=10, gamma=0.1 - Recall ",metrics.recall_score(y_test_1, y_pred_svm, average="weighted"))
print("SVC Tuning with PCA 5 feateures rbf, C=10, gamma=0.1 - F1 Score ",metrics.f1_score(y_test_1, y_pred_svm, average="weighted"))
SVC Tuning with PCA 5 feateures rbf, C=10, gamma=0.1 - Accuracy 0.8 SVC Tuning with PCA 5 feateures rbf, C=10, gamma=0.1 - Precision 0.8013901493276132 SVC Tuning with PCA 5 feateures rbf, C=10, gamma=0.1 - Recall 0.8 SVC Tuning with PCA 5 feateures rbf, C=10, gamma=0.1 - F1 Score 0.7991144889581198
clf = SVC(kernel='rbf',C = 100.0,gamma=0.1)
# fitting x samples and y classes
clf.fit(X_train_1, y_train_1)
y_pred_svm = clf.predict(X_test_1)
print("SVC Tuning with PCA 5 feateures rbf, C=100, gamma=0.1 - Accuracy ",metrics.accuracy_score(y_test_1, y_pred_svm))
print("SVC Tuning with PCA 5 feateures rbf, C=100, gamma=0.1 - Precision",metrics.precision_score(y_test_1, y_pred_svm, average="weighted"))
print("SVC Tuning with PCA 5 feateures rbf, C=100, gamma=0.1 - Recall ",metrics.recall_score(y_test_1, y_pred_svm, average="weighted"))
print("SVC Tuning with PCA 5 feateures rbf, C=100, gamma=0.1 - F1 Score ",metrics.f1_score(y_test_1, y_pred_svm, average="weighted"))
SVC Tuning with PCA 5 feateures rbf, C=100, gamma=0.1 - Accuracy 0.8058823529411765 SVC Tuning with PCA 5 feateures rbf, C=100, gamma=0.1 - Precision 0.81821432975706 SVC Tuning with PCA 5 feateures rbf, C=100, gamma=0.1 - Recall 0.8058823529411765 SVC Tuning with PCA 5 feateures rbf, C=100, gamma=0.1 - F1 Score 0.8058823529411765
clf = SVC(kernel='rbf',C = 100.0,gamma=0.01)
# fitting x samples and y classes
clf.fit(X_train_1, y_train_1)
y_pred_svm = clf.predict(X_test_1)
print("SVC Tuning with PCA 5 feateures rbf, C=100, gamma=0.01 - Accuracy ",metrics.accuracy_score(y_test_1, y_pred_svm))
print("SVC Tuning with PCA 5 feateures rbf, C=100, gamma=0.01 - Precision",metrics.precision_score(y_test_1, y_pred_svm, average="weighted"))
print("SVC Tuning with PCA 5 feateures rbf, C=100, gamma=0.01 - Recall ",metrics.recall_score(y_test_1, y_pred_svm, average="weighted"))
print("SVC Tuning with PCA 5 feateures rbf, C=100, gamma=0.01 - F1 Score ",metrics.f1_score(y_test_1, y_pred_svm, average="weighted"))
SVC Tuning with PCA 5 feateures rbf, C=100, gamma=0.01 - Accuracy 0.7411764705882353 SVC Tuning with PCA 5 feateures rbf, C=100, gamma=0.01 - Precision 0.7378605259375339 SVC Tuning with PCA 5 feateures rbf, C=100, gamma=0.01 - Recall 0.7411764705882353 SVC Tuning with PCA 5 feateures rbf, C=100, gamma=0.01 - F1 Score 0.7390252300624871
clf = SVC(kernel='rbf',C = 10000.0,gamma=0.01)
# fitting x samples and y classes
clf.fit(X_train_1, y_train_1)
y_pred_svm = clf.predict(X_test_1)
print("SVC Tuning with PCA 5 feateures rbf, C=10000, gamma=0.01 - Accuracy ",metrics.accuracy_score(y_test_1, y_pred_svm))
print("SVC Tuning with PCA 5 feateures rbf, C=10000, gamma=0.01 - Precision",metrics.precision_score(y_test_1, y_pred_svm, average="weighted"))
print("SVC Tuning with PCA 5 feateures rbf, C=10000, gamma=0.01 - Recall ",metrics.recall_score(y_test_1, y_pred_svm, average="weighted"))
print("SVC Tuning with PCA 5 feateures rbf, C=10000, gamma=0.01 - F1 Score ",metrics.f1_score(y_test_1, y_pred_svm, average="weighted"))
SVC Tuning with PCA 5 feateures rbf, C=10000, gamma=0.01 - Accuracy 0.8235294117647058 SVC Tuning with PCA 5 feateures rbf, C=10000, gamma=0.01 - Precision 0.8288524500709374 SVC Tuning with PCA 5 feateures rbf, C=10000, gamma=0.01 - Recall 0.8235294117647058 SVC Tuning with PCA 5 feateures rbf, C=10000, gamma=0.01 - F1 Score 0.8247829781643972
clf = SVC(kernel='linear',C = 0.1)
# fitting x samples and y classes
clf.fit(X_train_1, y_train_1)
y_pred_svm = clf.predict(X_test_1)
print("SVC Tuning with PCA 5 feateures linear, C=0.1 - Accuracy ",metrics.accuracy_score(y_test_1, y_pred_svm))
print("SVC Tuning with PCA 5 feateures linear, C=0.1 - Precision",metrics.precision_score(y_test_1, y_pred_svm, average="weighted"))
print("SVC Tuning with PCA 5 feateures linear, C=0.1 - Recall ",metrics.recall_score(y_test_1, y_pred_svm, average="weighted"))
print("SVC Tuning with PCA 5 feateures linear, C=0.1 - F1 Score ",metrics.f1_score(y_test_1, y_pred_svm, average="weighted"))
SVC Tuning with PCA 5 feateures linear, C=0.1 - Accuracy 0.6647058823529411 SVC Tuning with PCA 5 feateures linear, C=0.1 - Precision 0.6577564359076965 SVC Tuning with PCA 5 feateures linear, C=0.1 - Recall 0.6647058823529411 SVC Tuning with PCA 5 feateures linear, C=0.1 - F1 Score 0.6593503497197056
clf = SVC(kernel='linear',C = 1)
# fitting x samples and y classes
clf.fit(X_train_1, y_train_1)
y_pred_svm = clf.predict(X_test_1)
print("SVC Tuning with PCA 5 feateures linear, C=1 - Accuracy ",metrics.accuracy_score(y_test_1, y_pred_svm))
print("SVC Tuning with PCA 5 feateures linear, C=1 - Precision",metrics.precision_score(y_test_1, y_pred_svm, average="weighted"))
print("SVC Tuning with PCA 5 feateures linear, C=1 - Recall ",metrics.recall_score(y_test_1, y_pred_svm, average="weighted"))
print("SVC Tuning with PCA 5 feateures linear, C=1 - F1 Score ",metrics.f1_score(y_test_1, y_pred_svm, average="weighted"))
SVC Tuning with PCA 5 feateures linear, C=1 - Accuracy 0.6705882352941176 SVC Tuning with PCA 5 feateures linear, C=1 - Precision 0.6661869299640197 SVC Tuning with PCA 5 feateures linear, C=1 - Recall 0.6705882352941176 SVC Tuning with PCA 5 feateures linear, C=1 - F1 Score 0.6672787493375729
clf = SVC(kernel='linear',C = 10)
# fitting x samples and y classes
clf.fit(X_train_1, y_train_1)
y_pred_svm = clf.predict(X_test_1)
print("SVC Tuning with PCA 5 feateures linear, C=10 - Accuracy ",metrics.accuracy_score(y_test_1, y_pred_svm))
print("SVC Tuning with PCA 5 feateures linear, C=10 - Precision",metrics.precision_score(y_test_1, y_pred_svm, average="weighted"))
print("SVC Tuning with PCA 5 feateures linear, C=10 - Recall ",metrics.recall_score(y_test_1, y_pred_svm, average="weighted"))
print("SVC Tuning with PCA 5 feateures linear, C=10 - F1 Score ",metrics.f1_score(y_test_1, y_pred_svm, average="weighted"))
SVC Tuning with PCA 5 feateures linear, C=10 - Accuracy 0.6647058823529411 SVC Tuning with PCA 5 feateures linear, C=10 - Precision 0.6593506819513012 SVC Tuning with PCA 5 feateures linear, C=10 - Recall 0.6647058823529411 SVC Tuning with PCA 5 feateures linear, C=10 - F1 Score 0.6609552199258082
clf = SVC(kernel='linear',C = 100)
# fitting x samples and y classes
clf.fit(X_train_1, y_train_1)
y_pred_svm = clf.predict(X_test_1)
print("SVC Tuning with PCA 5 feateures linear, C=100 - Accuracy ",metrics.accuracy_score(y_test_1, y_pred_svm))
print("SVC Tuning with PCA 5 feateures linear, C=100 - Precision",metrics.precision_score(y_test_1, y_pred_svm, average="weighted"))
print("SVC Tuning with PCA 5 feateures linear, C=100 - Recall ",metrics.recall_score(y_test_1, y_pred_svm, average="weighted"))
print("SVC Tuning with PCA 5 feateures linear, C=100 - F1 Score ",metrics.f1_score(y_test_1, y_pred_svm, average="weighted"))
SVC Tuning with PCA 5 feateures linear, C=100 - Accuracy 0.6647058823529411 SVC Tuning with PCA 5 feateures linear, C=100 - Precision 0.6593506819513012 SVC Tuning with PCA 5 feateures linear, C=100 - Recall 0.6647058823529411 SVC Tuning with PCA 5 feateures linear, C=100 - F1 Score 0.6609552199258082
#... 4B.... Best parameters observed
SVC Tuning with PCA 5 feateures rbf, C=10000, gamma=0.01 - Accuracy 0.8235294117647058
SVC Tuning with PCA 5 feateures rbf, C=10000, gamma=0.01 - Precision 0.8288524500709374
SVC Tuning with PCA 5 feateures rbf, C=10000, gamma=0.01 - Recall 0.8235294117647058
SVC Tuning with PCA 5 feateures rbf, C=10000, gamma=0.01 - F1 Score 0.8247829781643972
#.... 4C.... printing classification metrics and sharing Insights with all models
SVC Tuning with PCA 5 features & rbf, C=1, gamma='auto' - Accuracy 0.7647058823529411
SVC Tuning with PCA 5 features & rbf, C=1, gamma='auto' - Precision 0.7680456095481671
SVC Tuning with PCA 5 features & rbf, C=1, gamma='auto' - Recall 0.7647058823529411
SVC Tuning with PCA 5 features & rbf, C=1, gamma='auto' - F1 Score 0.7641675269418904
SVC Tuning with PCA 5 features rbf, C=1, gamma=0.1 - Accuracy 0.7352941176470589
SVC Tuning with PCA 5 features rbf, C=1, gamma=0.1 - Precision 0.7350656051894442
SVC Tuning with PCA 5 features rbf, C=1, gamma=0.1 - Recall 0.7352941176470589
SVC Tuning with PCA 5 features rbf, C=1, gamma=0.1 - F1 Score 0.7351326412918109
SVC Tuning with PCA 5 features rbf, C=10, gamma=0.1 - Accuracy 0.8
SVC Tuning with PCA 5 features rbf, C=10, gamma=0.1 - Precision 0.8013901493276132
SVC Tuning with PCA 5 features rbf, C=10, gamma=0.1 - Recall 0.8
SVC Tuning with PCA 5 features rbf, C=10, gamma=0.1 - F1 Score 0.7991144889581198
SVC Tuning with PCA 5 features rbf, C=100, gamma=0.1 - Accuracy 0.8058823529411765
SVC Tuning with PCA 5 features rbf, C=100, gamma=0.1 - Precision 0.81821432975706
SVC Tuning with PCA 5 features rbf, C=100, gamma=0.1 - Recall 0.8058823529411765
SVC Tuning with PCA 5 features rbf, C=100, gamma=0.1 - F1 Score 0.8058823529411765
SVC Tuning with PCA 5 features rbf, C=100, gamma=0.01 - Accuracy 0.7411764705882353
SVC Tuning with PCA 5 features rbf, C=100, gamma=0.01 - Precision 0.7378605259375339
SVC Tuning with PCA 5 features rbf, C=100, gamma=0.01 - Recall 0.7411764705882353
SVC Tuning with PCA 5 features rbf, C=100, gamma=0.01 - F1 Score 0.7390252300624871
SVC Tuning with PCA 5 features rbf, C=10000, gamma=0.01 - Accuracy 0.8235294117647058
SVC Tuning with PCA 5 features rbf, C=10000, gamma=0.01 - Precision 0.8288524500709374
SVC Tuning with PCA 5 features rbf, C=10000, gamma=0.01 - Recall 0.8235294117647058
SVC Tuning with PCA 5 features rbf, C=10000, gamma=0.01 - F1 Score 0.8247829781643972
It is observed that kernel='rbf' gives better scores than "linear"
Increase of C,(the penalty parameter of the error term) in kernel 'rbf' seems to improve model performance
just an increase of C=1 to 10 improved the performance from 0.73 to 0.8
further increasing it to 100, 1000 & finally 10000 improved it from 0.8 to 0.82
#..... Using GridSearchCV to figure out the best hyper parameter tuning
from sklearn.model_selection import GridSearchCV
svc= SVC()
parameters = [{"kernel": ["rbf"],"gamma":[0.1,0.01],"C":[0.1,1,10,100,1000,10000]},
{"kernel": ["linear"],"C": [0.1,1,10,100,1000,10000]}]
cv = GridSearchCV(svc,parameters,cv=5)
cv.fit(X_test_1,y_test_1)
def display(results):
print(f'Best parameters are: {results.best_params_}')
mean_score = results.cv_results_['mean_test_score']
std_score = results.cv_results_['std_test_score']
params = results.cv_results_['params']
for mean,std,params in zip(mean_score,std_score,params):
print(f'{round(mean,2)} + or -{round(std,2)} for the {params}')
display(cv)
Best parameters are: {'C': 10000, 'gamma': 0.01, 'kernel': 'rbf'}
0.53 + or -0.0 for the {'C': 0.1, 'gamma': 0.1, 'kernel': 'rbf'}
0.53 + or -0.0 for the {'C': 0.1, 'gamma': 0.01, 'kernel': 'rbf'}
0.62 + or -0.05 for the {'C': 1, 'gamma': 0.1, 'kernel': 'rbf'}
0.53 + or -0.0 for the {'C': 1, 'gamma': 0.01, 'kernel': 'rbf'}
0.68 + or -0.06 for the {'C': 10, 'gamma': 0.1, 'kernel': 'rbf'}
0.62 + or -0.06 for the {'C': 10, 'gamma': 0.01, 'kernel': 'rbf'}
0.71 + or -0.07 for the {'C': 100, 'gamma': 0.1, 'kernel': 'rbf'}
0.6 + or -0.06 for the {'C': 100, 'gamma': 0.01, 'kernel': 'rbf'}
0.64 + or -0.08 for the {'C': 1000, 'gamma': 0.1, 'kernel': 'rbf'}
0.68 + or -0.07 for the {'C': 1000, 'gamma': 0.01, 'kernel': 'rbf'}
0.65 + or -0.06 for the {'C': 10000, 'gamma': 0.1, 'kernel': 'rbf'}
0.71 + or -0.1 for the {'C': 10000, 'gamma': 0.01, 'kernel': 'rbf'}
0.57 + or -0.03 for the {'C': 0.1, 'kernel': 'linear'}
0.6 + or -0.07 for the {'C': 1, 'kernel': 'linear'}
0.6 + or -0.08 for the {'C': 10, 'kernel': 'linear'}
0.6 + or -0.08 for the {'C': 100, 'kernel': 'linear'}
0.6 + or -0.08 for the {'C': 1000, 'kernel': 'linear'}
0.61 + or -0.07 for the {'C': 10000, 'kernel': 'linear'}
Pre-requisites/Assumptions of Principal component Analysis
PCA assumes that the principal component with high variance must be paid attention and the PCs with lower variance are disregarded as noise. Pearson correlation coefficient framework led to the origin of PCA, and there it was assumed first that the axes with high variance would only be turned into principal components.
All variables should be accessed on the same ratio level of measurement. The most preferred norm is at least 150 observations of the sample set with a ratio measurement of 5:1.
Extreme values that deviate from other data points in any dataset, which are also called outliers, should be less. More number of outliers will represent experimental errors and will degrade your ML model/algorithm
You should have sampling adequacy, which simply means that for PCA to produce a reliable result, large enough sample sizes are required.
The data should be suitable for data reduction. Effectively, one needs to have adequate correlations between the variables in order for variables to be reduced to a smaller number of components.
Advantages of PCA
a) Removes multi-collinearity - Multi-collinearity, Co-relation between independent variables are not good for the Machine learning performance, PCA combines highly correlated variables into a set of uncorrelated orthogonal principal components, effectively eliminating multicollinearity between features
b) Decreases computational time - Since it reduces the dimensionality by reducing huge number of independent variables to a lesser set of corelated functional components, the trainining and testing of the ML model takes lesser time.
c) Helps to reduce overfitting - Overfitting mainly occurs when there are too many variables in the dataset. So, PCA helps in overcoming the overfitting issue by reducing the number of features.
d) Improves Visualization - PCA transforms a high dimensional dataset to low dimensional data (2 dimension) so that it can be visualized easily. 2D Scree Plot can be used to see which Principal Components result in high variance and have more impact as compared to other Principal Components.
e) Improves Algorithm performance - With so many features, the performance of algorithm will drastically degrade. PCA will speed up the Machine Learning algorithm by getting rid of correlated variables which don’t contribute in any decision making. The training time of the algorithms reduces significantly with less number of features. So, if the input dimensions are too high, then using PCA to speed up the algorithm is a reasonable choice.
Limitations / Disadvantages of PCA
a) Data normalization is a must before applying PCA - Standardization of dataset is a must before implementing PCA, otherwise it will be difficult to find the optimal Principal Components.
b) Will result in Some level of information loss - Due to addressing of outliers and dimensionality reduction there could be some informational loss. Although Principal Components try to cover maximum variance among the features in a dataset, if we don’t select the right number of Principal Components with care, it may miss some information as compared to the original list of features
c) Independent variables will become less interpretable - After implementing PCA on the dataset, the original features will turn into Principal Components. Principal Components are the linear combination of the original features. Principal Components are not as readable and interpretable as original features
d) It is not suitable for small data sets